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METHOD AND DEVICE FOR COUPLING A DATA PROCESSING 
UNIT AND A DATA PROCESSING ARRAY 

5 

FIELD OF THE INVENTION 

The present invention relates to methods of operating and optimum use of reconfigurable arrays 
of data processing elements. 

10 BACKGROUND INFORMATION 

The limitations of conventional processors are becoming more and more evident. The growing 
importance of stream-based applications makes coarse-grain dynamically reconfigurable 
architectures an attractive alternative. See, e.g., R. Hartenstein, R. Kress, & H. Reinig, "A new 
FPGA architecture for word-oriented datapaths," Proc. FPL'94, Springer LNCS, September 

15 1994, at 849; E. Waingold et al., "Baring it all to software: Raw machines," IEEE Computer, 
September 1997. at 86-93; PACT Corporation. "The XPP Communication System." Technical 
Report 15 (2000); see generally the World Wide Web .com address of "pactcorp." They 
combine the performance of ASICs, which are very risky and expensive (development and 
mask costs), with the flexibility of traditional processors. See, for example. J. Becker. 

20 "Configurable Svstems-on-Chip (CSoCV (Invited Tutorial). Proc. of 9th Proc. of XV Brazilian 
Symposium on Integrated Circuit, Design (SBCCI 2002), (September 2002 s ). 

The datapaths of modern microprocessors reach their limits by using static instruction sets. In 
spite of the possibilities that exist today in VLSI development, the basic concepts of 

25 microprocessor architectures are the same as 20 years ago. The main processing unit of modern 
conventional microprocessors, the datapath, in its actual structure follows the same style 
guidelines as its predecessors. Although the development of pipelined architectures or 
superscalar concepts in combination with data and instruction caches increases the performance 
of a modern microprocessor and allows higher frequency rates, the main concept of a static 

30 datapath remains. Therefore, each operation is a composition of basic instructions that the used 
processor owns. The benefit of the processor concept lies in the ability of executing strong 
control dominant application. Data or stream oriented applications are not well suited for this 
environment. The sequential instruction execution isn't the right target for that kind of 
application and needs high bandwidth because of permanent retransmitting of instruction/data 
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from and to memory. This handicap is often eased by use of caches in various stages. A 
sequential interconnection of filters, which perform data manipulation without writing back the 
intermediate results would get the right optimisation and reduction of bandwidth. Practically, 
this kind of chain of filters should be constructed in a logical way and configured during 
5 runtime. Existing approaches to extend instruction sets use static modules, not modifiable 
during runtime. 

Customized microprocessors or ASICs are optimized for one special application environment- 
It is nearly impossible to use the same microprocessor core for another application without 
10 losing the performance gain of this architecture. 

A new approach of a flexible and high performance datapath concept is needed, which allows 
for reconfiguring the functionality and for making this core mainly application independent 
without losing the performance needed for stream-based applications. 

15 

When using a reconfigurable array, it is desirable to optimize the way in which the array is 
coupled to other units, e.g.. to a processor if the array is used as a coprocessor. It is also 
desirable to optimize the way in which the array is configured. 

20 Further, WO 00/49496 discusses a method for execution of a computer program using a 
processor that includes a configural functional unit capable of executing reconfigurable 
instructions, which can be redefined at runtime. A problem with conventionable processor 
architectures exists if a coupling of, for example, sequentional processors is needed and/or 
technologies such as a data-streaming, hyper-threading, multi-threading, multi-tasking, 

25 execution of parts of configurations, etc., are to be a useful way for enhancing performance. 
Techniques discussed in prior art, such as WO 02/50665 Al, do not allow for a sufficiently 
efficient way of providing for a data exchange between the ALU of a CPU and the configurable 
data processing logic cell field, such as an FPGA. DSP, or other such arrangement. In the prior 
art, the the data exchange is effected via registers. In other words, it is necessary to first write 

30 data into a register sequentially, then retrieve them sequentially, and restore them sequentially 
as well. 

Another problem exists if an external access to data is requested in known devices used, inter 
alia, to implement functions in the configurable data processing logic cell field, DFP, FPGA, 
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etc., that cannot be processed sufficiently on a CPU-integrated ALU. Accordingly, the data 
processing logic cell field is practically used to allow for user-defined opcodes that can process 
data more efficiently than is possible on the ALU of the CPU without further support by the 
data processing logic cell field. In the prior art, the coupling is generally word-based, not 
5 block-based. A more efficient data processing, in particular more efficient than possible with a 
close coupling via registers, is highly desirable. 

Another method for the use of logic cell fields that include coarse- and/or fine-granular logic 
cells and logic cell elements provides for a very loose coupling of such a field to a conventional 

10 CPU and/or a CPU-core in embedded systems. In this regard, a conventional sequential 

program can be executed on the CPU, for example a program written in C, C++, etc., wherein 
the instantiation or the data stream processing by the fine- and/or coarse-granular data 
processing logic cell field is effected via that sequential program. However, a problem exists in 
that for programming said logic cell field, a program not written in C or another sequential 

15 high-level language must be provided for the data stream processing. It is desirable to allow for 
C-programs to run both on a conventional CPU-architecture as well as on the data processing 
logic cell field operated therewith., in particular, despite the fact that a quasi-sequential program 
execution should maintain the capability of data-streaming in the data processing logic cell 
fields, whereas simultaneously the capability exists to operate the CPU in a not too loosely 

20 coupled way. 

It is already known to provide for sequential data processing within a data processing logic cell 
field. See, for example. DE 196 51 075. WO 98/26356. DE 196 54 846. WO 98/29952. DE 197 
04 728. WO 98/35299. DE 199 26 538. WO 00/77652. and DE 102 12 621. Partial execution is 

25 achieved within a single configuration, for example, to reduce the amount of resources needed, 
to optimize the time of execution, etc. However, this does not lead automatically to allowing a 
programmer to translate or transfer high-level language code automatically onto a data 
processing logic cell field as is the case in common, machine models for sequential processes. 
The compilation, transfer, or translation of a high-level language code onto data processing 

30 logic cell fields according to the methods known for models of sequentially executing machines 
is difficult. 

In the prior art, it is further known that configurations that effect different functions on parts of 
the area respectively can be simultaneously executed on the processing array and that a change 
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of one or some of the configuration's) without disturbing other configurations is possible at run- 
time. Methods and hardware-implemented means for the implementation are known to ensure 
that the execution of partial configurations to be loaded onto the array is possible without 
deadlock. Reference is made to DE 196 54 593. WO 98/31102. DE 198 07 872, WO 99/44147, 
5 DE 199 26538, WO 00/77652, DE 100 28 397, and WO 02/13000. This technology allows in a 
certain way a certain parallelism and, given certain forms and interrelations of the 
configurations or partial configurations for a certain way of multitasking/multi-threading, in 
particular in such a way that the planning, i.e.. the scheduling and/or the planning control for 
time use, can be provided for. Furthermore, from the prior art, time use planning control means 
10 and methods are known that, at least under a corresponding interrelation of configurations 
and/or assignment of configurations to certain tasks and/or threads to configurations and/or 
sequences of configurations, allow for a multi-tasking and/or multi-threading. 

SUMMARY OF THE INVENTION 
15 Embodiments of the present invention may improve upon the prior art with respect to 

optimization of the way in which a reconfigurable array is coupled to other units and/or the way 
in which the array is configured. 

A way out of limitations of conventional microprocessors may be a dynamic reconfigurable 
20 processor datapath extension achieved by integrating traditional static datapaths with the coarse- 
grain dynamic reconfigurable XPP-architecture (eXtreme Processing Platform). Embodiments 
of the present invention introduce a new concept of loosely coupled implementation of the 
dynamic reconfigurable XPP architecture from PACT Corp. into a static datapath of the SPARC 
compatible LEON processor. Thus, this approach is different from those where the XPP 
25 operates as a completely separate (master) component within one Configurable System-on-Chip 
(CsoC), together with a processor core, global/local memory topologies, and efficient multi- 
layer Amba-bus interfaces. See, for example. J. Becker & M. Vorbach. "Architecture. Memory 
and Interface Technology Integration of an Industrial/ Academic Configurable System-on-Chip 
(CSoC)." IEEE Computer Society Annual Workshop on VLSI (WVLSI 2003). (February 2003). 
30 From the programmer's point of view, the extended and adapted datapath may seem like a 

dynamic configurable instruction set. It can be customized for a specific application and can 
accelerate the execution enormously. Therefore, the programmer has to create a number of 
configurations that can be uploaded to the XPP-Array at run time. For example, this 
configuration can be used like a filter to calculate stream-oriented data. It is also possible to 
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configure more than one function at the same time and use them simultaneously. These 
embodiments may provide an enormous performance boost and the needed flexibility and 
power reduction to perform a series of applications very effective. 

5 Embodiments of the present invention may provide a hardware framework, which may enable 
an efficient integration of a PACT XPP core into a standard RISC processor architecture. 

Embodiments of the present invention may provide a compiler for a coupled RISC + XPP 
hardware. The compiler may decide automatically which part of a source code is executed on 
10 the RISC processor and which part is executed on the PACT XPP core- 
In an example embodiment of the present invention, a C Compiler may be used in cooperation 
with the hardware framework for the integration of the PACT XPP core and RISC processor. 

15 In an example embodiment of the present invention, the proposed hardware framework may 

accelerate the XPP core in two respects. First, data throughput may be increased by raising the 
XPP's internal operating frequency into the range of the RISC's frequency. This, however, may 
cause the XPP to run into the same pit as all high frequency processors, i.e.. memory accesses 
may become very slow compared to processor internal computations. Accordingly, a cache 

20 may be provided for use. The cache may ease the memory access problem for a large range of 
algorithms, which are well suited for an execution on the XPP. The cache, as a second 
throughput increasing feature, may require a controller. A programmable cache controller may 
be provided for managing the cache contents and feeding the XPP core. It may decouple the 
XPP core computations from the data transfer so that, for instance, data preload to a specific 

25 cache sector may take place while the XPP is operating on data located in a different cache 
sector. 

A problem which may emerge with a coupled RISC+XPP hardware concerns the RISC's 
multitasking concept. It may become necessary to interrupt computations on the XPP in order 
30 to perform a task switch. Embodiments of the present invention may provided for hardware and 
a compiler that supports multitasking. First, each XPP configuration may be considered as an 
uninterruptible entity. This means that the compiler, which generates the configurations, may 
take care that the execution time of any configuration does not exceed a predefined time slice. 
Second, the cache controller may be concerned with the saving and restoring of the XPP's state 
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after an interrupt. The proposed cache concept may minimize the memory traffic for interrupt 
handling and frequently may even allow avoiding memory accesses at all. 

In an example embodiment of the present invention, the cache concept may be based on a 
5 simple internal RAM (1RAM") cell structure allowing for an easy scalability of the hardware. 
For instance, extending the XPP cache size, for instance, may require not much more than the 
duplication of IRAM cells. 

In an embodiment of the present invention, a compiler for a RISC + XPP system may provide 
10 for compilation for the RISC + XPP system of real world applications written in the C 

language. The compiler may remove the necessity of developing NML (Native Mapping 
Language) code for the XPP by hand. It may be possible, instead, to implement algorithms in 
the C language or to directly use existing C applications without much adaptation to the XPP 
system. The compiler may include the following three major components to perform the 
15 compilation process for the XPP : 

1 . partitioning of the C source code into RISC and XPP parts; 

2. transformations to optimize the code for the XPP; and 

3. generating of NML code. 

20 The generated NML code may be placed and routed for the XPP. 

The partitioning component of the compiler may decide which parts of an application code can 
be executed on the XPP and which parts are executed on the RISC. Typical candidates for 
becoming XPP code may be loops with a large number of iterations whose loop bodies are 
25 dominated by arithmetic operations. The remaining source code - including the data transfer 
code - may be compiled for the RISC. 

The compiler may transform the XPP code such that it is optimized for NML code generation. 
The transformations included in the compiler may include a large number of loop 
30 transformations as well as general code transformations. Together with data and code analysis 
the compiler may restructure the code so that it fits into the XPP array and so that the final 
performance may exceed the pure RISC performance. The compiler may generate NML code 
from the transformed program. The whole compilation process may be controlled by an 
optimization driver which selects the optimal order of transformations based on the source code. 
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Discussed below with respect to embodiments of the present invention are case studies, the 
basis of the selection of which is the guiding principle that each example may stand for a set of 
typical real-world applications. For each example is demonstrated the work of the compiler 
5 according to an embodiment of the present invention. For example, first partitioning of the 
code is discussed. The code transformations, which may be done by the compiler, are shown 
and explained. Some examples require minor source code transformations which may be 
performed by hand. These transformations may be either too expensive, or too specific to make 
sense to be included in the proposed compiler. Dataflow graphs of the transformed codes are 
10 constructed for each example, which may be used by the compiler to generate the NML code- 
In addition, the XPP resource usages are shown. The case studies demonstrate that a compiler 
containing the proposed transformations can generate efficient code from numerical 
applications for the XPP. This is possible because the compiler may rely on the features of the 
suggested hardware, like the cache controller. 

15 

Other embodiments of the present invention pertain to a realization that for data-streaming data- 
processing, block-based coupling is highly preferable. This is in contrast to a word-based 
coupling discussed above with respect to the prior art. 

20 Further, embodiments of the present invention provide for the use of time use planning control 
means, discussed above with respect to their use in the prior art, for configuring and 
management of configurations for the purpose of scheduling of tasks, threads, and multi- and 
hyper-threads. 

25 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates a memory hierarchy of the XPP core and the RISC core using a special cache 
controller. 

30 Fig. 2 illustrates an IRAM and configuration cache controller data structures and usage 
example. 

Fig. 3 illustrates an asynchronous pipeline of the XPP. 
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Fig. 4 illustrates a diagram of state transitions for the XPP cache controller. 

Fig. 5 illustrates illustrates a memory hierarchy of the XPP core and the RISC core using a 
special cache controller with added simultaneous multithreading. 

5 

Fig. 6 illustrates a cache structure example. 
Fig. 7 illustrates a control-flow graph of a piece of a program. 
10 Fig. 8 illustrates an example of control-flow sensitivity. 
Fig. 9 illustrates an example of alignment analysis. 
Fig. 10 illustrates an example for array merging. 

15 

Fig. 1 1 illustrates a global view of the compiling process. 
Fig. 12 illustrates a detailed architecture of the XPP compiler. 
20 Fig. 13 illustrates a detailed view of the XPP loop optimization. 
Fig. 14 illustrates implementations of converter modules. 
Fig. 15 illustrates an inner loop calculation dataflow graph. 

25 

Fig. 16 illustrates input preparation with shift register synthesis. 
Fig. 17 illustrates an example of loop tiling. 
30 Fig. 18 illustrates a dataflow graph representing the loop body. 
Fig. 19 illustrates a dataflow graph representing the inner loop. 
Fig. 20 illustrates overlaps of different iterations. 
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Fig. 21 illustrates visualized array access sequences. 

Fig. 22 illustrates visualized array access sequences after optimization. 

5 

Fig. 23 illustrates a dataflow graph of matrix multiplication after unroll-and-jam. 

Fig. 24 illustrates a dataflow graph of a butterfly loop. 

10 Fig. 25 illustrates a modified dataflow graph, in which unrolling and splitting have been omitted 
for simplicity. 

Fig. 26 illustrates a dataflow graph of an MPEG inverse quantization for intra coded blocks. 

15 Fig. 27 illustrates an idct function. 

Fig. 28 illustrates an example implementation for saturate (va!. n) as NML schematic using two 
ALUs. 

20 Fig. 29 illustrates an example of a pipelines. 

Fig. 30 illustrates a dataflow graph of idct column processing. 

Fig. 31 illustrates data layout transformations in idct configurations. 
25 Fig. 32 illustrates a dataflow graph of an innermost loop nest. 

Fig. 33 illustrates functions of an RDFP. 

Fig. 34 illustrates a CDFG with two ALUs. 

30 

Fig. 35 illustrates a resulting CDFG. 

Fig. 36 illustrates a resulting CDFG transformed from two read accesses shown in Fig. 44. 
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Fig. 37 illustrates a final CDFG transformed from a single access shown in Fig. 45. 
Fig. 38 illustrates a final CDFG of an example with three read accesses. 
5 Fig. 39 illustrates a generated CDFG for an example for loop. 
Fig. 40 illustrates a general conditional statement template. 
Fig. 41 illustrates a while loop template. 

10 

Fig. 42 illustrates a for loop template- 
Fig. 43 illustrates all accesses to the same RAM combined and substituted by a single RAM 
function. 

15 

Fig. 44 illustrates an intermediate CDFG with two read accesses. 

Fig. 45 illustrates an example of a write access. 

20 Fig. 46 illustrates an optimized version of the example of Figs. 36 and 44 using the ESEQ- 
method. 

Fig. 47 illustrates an intermediate CDFG generated before the array access Phase 2 
transformation is applied. 

25 

Fig. 48 illustrates a final CDFG after Phase 2 transformation is applied. 
Fig. 49 illustrates a LEON architecture overview. 
30 Fig. 50 illustrates a LEON pipelined datapath structure- 
Fig. 51 illustrates a a structure of an XPP device- 
Fig. 52 illustrates an extended datapath overview. 
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Fig. 53 illustrates a LEON-to-XPP dual-clock FIFO. 

Fig. 54 illustrates an example of an extended LEON instruction pipeline. 

5 

Fig. 55 illustrates a computation time of IDCT (8x8"). 

Fig. 56 illustrates an MPEG-4 decoder block diagram. 

10 Fig. 57 illustrates another example of an extended LEON instruction pipeline. 

DETAILED DESCRIPTION OF THE INVENTION 
2-HARDWARE 
2-r4-Design Parameter Changes 
1 5 rOOOll Since For integration of the XPP core shall be integrated as a functional unit into a 
standard RISC core, some system parameters have to may be reconsidered as follows : 

2.1.1 Pipelining^./ Concurrency an-d/ Synchronicity 

[0002] RISC instructions of totally different type (Ld/St, ALU, Muj -MuL /Div/M AC . FPALU, 
20 FPMul . . . , etc. ) af emay be executed in separate specialized functional units to increase the 

fraction of silicon that is busy on average. Such functional unit separation has led to superscalar 
RISC designs-; that exploit higher levels of parallelism. 

[0003] Each functional unit of a RISC core i smay be highly pipelined to improve throughput. 
25 Pipelining overlaps may overlap the execution of several instructions by splitting them into 

unrelated phases, which are -may be executed in different stages of the pipeline. ThuSi different 
stages of consecutive instructions can be executed in parallel with each stage taking much less 
time to execute. This allows may allow higher core frequencies. 

30 [0001] Sinco With an approximate subdivision of the pipelines of all functional units are 
approximately subdivided into sub-operations of the same size (execution time), these 
functional units / pipelines may execute in a highly synchronous manner with complex floating 
point pipelines being the exception. 



NY01 1641442 



11 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



[0005] Since the XPP core uses dataflow data flow computation, it is pipelined by design. 
However, a single configuration usually implements a loop of the application, so the 
configuration remains active for many cycles, unlike the instructions in every other functional 
unit, which typically execute for one or two cycles at most. Therefore.! it is still worthwhile to 
5 consider the separation of several phases,, (e.g.^ Ld / Ex / StoreX of an XPP configuration^ 

(— i.e.. an XPP instruction^ into several functional units to improve concurrency via pipelining 
on this coarser scale. This also improves may improve throughput and response time in 
conjunction with multi tasking operations and implementations of simultaneous multithreading 
(SMT). 

10 

[0006] The multi cycle execution time may also forbids forbid a strongly synchronous execution 
scheme and may r ather leads lead to an asynchronous scheme, e.g., like for e.g. floating point 
square root units. This in turn nocossitatos may necessitate the existence of explicit 
synchronization instructions. 

15 

2.1.2 Core Frequency and the / Memory Hierarchy 

[0007] As a functional unit, the XPP's operating frequency wi Umay either be half of the core 
frequency or equal to the core frequency of the RISC. Almost every RISC core currently on the 
market exceeds its memory bus frequency with its core frequency by a larger factor. Therefore^ 
20 caches are employed, forming what is commonly called the memory hierarchy : Eac h , where 
each layer of cache is larger but slower than its predecessors. 

[0008] This memory hierarchy does not help to speed up computations which shuffle large 
amounts of data^ with little or no data reuse. These computations are called "bounded by 
25 memory bandwidth". ^Howeve^ other types of computations with more data locality (another 
namo term for data reuse) may gain performance as long as they fit into one of the upper layers 
of the memory hierarchy. This is the class of applications that gai n gains the highest speedups 
when a memory hierarchy is introduced. 

30 [0009] Classical vectorization can be used to transform memory-bounded algorithms, with a 
data set too big to fit into the upper layers of the memory hierarchy. Rewriting the code to 
reuse smaller data sets sooner exposes memory reuse on a smaller scale. As the new data set 
size is chosen to fit into the caches of the memory hierarchy, the algorithm is not memory 
bounded any moro anymore , yielding significant speed-ups. 
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2.1.3 Software^ Multitasking Operating Systems 

[0010] As the XPP is introduced into a RISC core, the changed environment — higher frequency 
and the memory hierarchy not only - may necessitate , not only reconsideration of hardware 
5 design parameters, but also a reevaluation of the software environment. 

Memory Hierarchy 

[0011] The introduction of a memory hierarchy enhances may enhance the set of applications 
that can be implemented efficiently. So far, the XPP has mostly been used for algorithms that 
10 read their data sets in a linear manner, applying some calculations in a pipelined fashion and 
writing the data back to memory. As long as all of the computation fits into the XPP array, 
these algorithms are memory bounded. Typical applications are filtering and audio signal 
processing in general. 

1 5 [0012] But there is another set of algorithms^ that have even higher computational complexity 
and higher memory bandwidth requirements. Examples are picture and video processing, 
where a second and third dimension of data coherence opens up. This coherence is^ e.g^ 
exploited by picture and video compression algorithms^ that scan pictures in both dimensions to 
find similarities, even searching consecutive pictures of a video stream for an-analogies. 

20 Naturally these These algorithms have a much higher algorithmic complexity as well as higher 
memory requirements. Yet they are data local, either by design or they can be transformed to 
b eby transformation , thus efficiently exploiting the memory hierarchy and the higher clock 
frequencies of processors with memory hierarchies. 

25 

Multi Tasking 

[0013] The introduction into a standard RISC core makes it necessary to understand and support 
the needs of a multitasking operating system, as standard RISC processors are usually operated 
30 in multitasking environments. With multitasking, the operating system switchos may switch the 
executed application on a regular basis, thus simulating concurrent execution of several 
applications (tasks). To switch tasks, the operating system has -may have to save the state;, 
(ire.g^ the contents of all registers^ of the running task and then reload the state of another task. 
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Hencei it i smay be necessary to determine what the state of the processor is, and to keep it as 
small as possible to allow efficient context switches. 



[0011] Modern microprocessors gain their performance from multiple specialized and deeply 
5 pipelined functional units and high memory hierarchies, enabling high core frequencies. But 
high memory hierarchies mean that there is a high penalty for cache misses due to the 
difference between core and memory frequency. Many core cycles may pass until the values 
are finally available from memory. Deep pipelines incur pipeline stalls due to data 



10 Specialized functional units like floating point units idle for integer-only programs. For these 
reasons, average functional unit utilization is much too low. 

[0015] The newest development with RISC processors, Simultaneous MultiThreading (SMT), 
adds hardware support for a finer granularity (instruction / functional unit level) switching of 
15 tasks, exposing more than one independent instruction stream to be executed. Thus, whenever 
one instruction stream stalls or doesn't utilize all functional units, the other one can jump in. 
This improves functional unit utilization for today's processors. 

[0016] With SMT, the task (process) switching is done in hardware, so the processor state has 
20 to be duplicated in hardware. So again it is most efficient to keep the state as small as possible. 
For the combination of the PACT XPP and a standard RISC processor, SMT i smay be very 
beneficial, since the XPP configurations may execute longer than the average RISC instruction. 
ThuSi another task can utilize the other functional units, while a configuration is running. On 
the other side -hand , not every task will utilize the XPP, so while one such non-XPP task is 
25 running, another one will be able to use the XPP core. 

2-r3-Communication Between the RISC Core and the XPP Corev 

[0017] The following sections introduce are several possible embodiments that are each a 
30 possible hardware implomontations implementation for accessing memory. 

2.2.1 Streaming 

[0018] Since streaming can only ? support (number_of_IO_ports * width_of_IO_port) bits per 

cycle, it is onl y may be well suited fo r only small XPP arrays with heavily pipelined 
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configurations that feature few inputs and outputs. As the pipelines take a long time to fill and 
empty while the running time of a configuration is limited (as described uade ? herein with 
respect to "context switches"), this type of communication does not scale well to bigger XPP 
arrays and XPP frequencies near the RISC core frequency. [0019] 

5 

Streaming from the RISC core [0020] 
In this setup, the RISC supplies may supply the XPP array with the streaming data. Since the 
RISC core has may have to execute several instructions to compute addresses and load an item 
from memory, this setup is only suited^ if the XPP core is reading data with a frequency much 
10 lower than the RISC core frequency. [0021] 



Streaming via DMA [0022] 
In this mode the RISC core only initializes a DMA channel which may then supplioo supply the 
data items to the streaming port of the XPP core. 2.2.2 

15 

Shared Memory (Main Memory) 
[0023] In this configuration.! the XPP array configuration uses may use a number of PAEs to 
generate an address that is used to access main memory through the IO ports. As the number of 
IO ports i smay be very limited^ this approach suffors may suffer from the same limitations as the 
20 previous one, although for larger XPP arrays th ethere is less impact of using PAEs for address 
generation is diminishing . However this approach is- may still be useful for loading values from 
very sparse vectors. 



25 2^3- Shared Memory (IRAM) 

[0021] This data access mechanism uses the IRAM elements to store data for local 
computations. The IRAMs can either be viewed as vector registers or as local copies of main 
memory. 

30 [0025] Thoro The following are several ways in which to fill the IRAMs with data . [0026] i 

1 . The IRAMs afe may be loaded in advance by a separate configuration using streaming. 
[0027] 

This method can be implemented with the current XPP architecture. The IRAMs act as 
vector registers. As explicated above, this w4Hmay limit the performance of the XPP array, 
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especially as the IRAMs will always be part of the externally visible state and hence must be 
saved and restored on context switches. [0028] 

2. The IRAMs ca nmay be loaded in advance by separate load-instructions. [0029] 

5 This is similar to the first method. Load-instructions may be implemented in hardware 

which lead- loads the data into the IRAMs are implcmontod in hardware . The load-instructions 
can be viewed as a_hard coded load- configuration. Therefore.! configuration reloads ar emay be 
reduced. Additionally, the special load instructions may use a wider interface to the memory 
hierarchy. [0030] Therefore^ a more efficient method than streaming can be used. [0031] 

10 

3. The IRAMs can be loaded by a "burst preload from memory" instruction of the cache 
controller. No configuration or load-instruction is needed on the XPP. The IRAM load i smay 
be implemented in the cache controller and triggered by the RISC processor. But the IRAMs 
may still act as vector registers and ar emay be therefore included in the externally visible state. 

15 [0032] 

4. The best mode^ however-4 s. may be a combination of the previous solutions with the 
extension of a cache: [0033] 

A preload instruction maps may map a specific memory area defined by starting address 
20 and size to an IRAM. This triggers may trigger a (delayed, low priority) burst load from the 

memory hierarchy (cache). After all IRAMs are mapped, the next configuration can be 

activated. The activation incurs m ay incur a wait until all burst loads are completed. However, 

if the preload instructions are issued long enough in advance and no interrupt or task switch 

destroys cache locality, the wait will not consume any time. [0031] 
25 To specify a memory block as output-only IRAM, a "preload clean" instruction i smay 

be used, which avoids may avoid loading data from memory. The "preload clean" instruction 

just dcfincs indicates the IRAM for write- back. [0035] 

A synchronization instruction i smay be needed to make sure that the content of a 

specific memory area, which is cached in IRAM, is written back to the memory hierarchy. This 
30 can be done globally (full write- back), or selectively by specifying the memory area, which 

will be accessed .Sr£ 

State of the XPP Core 
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|"00361 As described in tho previous soction A s discussed above , the size of the state i smay be 
crucial for the efficiency of context switches. However, although the size of the state i smay be 
fixed for the XPP core, it depends whether or not they have to be saved may depend on the 
declaration of the various state elements , whether they have to be saved or not. .. 

5 

[0037] The state of the XPP core can be classified as : 
[0038] 1 L_ReadT only (instruction data) [0039] 

* configuration data, consisting of PAE configuration and routing configuration data ; and 
[0010] 2 Z_Read - Write [0011] 

10 * the contents of the data registers and latches of the PAEs, which are driven onto the busses 
[0012] 

* the contents of the IRAM elements 2.3.1 . 



Limiting Memory Traffic 
15 [0013] There are several possibilities to limit the amount of memory traffic during context 
switches -, as follows: 



Do eetNot Save Read-Only Data 
[0011] This avoids may avoid storing configuration data, since configuration data is read only. 
20 The current configuration i smay be simply overwritten by the new one. 



Save Less Data 

[0015] If a configuration is defined to be uninterruptible (non pre-emptive), all of the local state 
25 on the busses and in the PAEs can be declared as scratch. This means that every configuration 
ge temay get its input data from the IRAMs and write smay write its output data to the IRAMs. 
So after the configuration has finished.! all information in the PAEs and on the buses is may be 
redundant or invalid and dee ssaving of the information might not have to be save drequired . 



30 Save Modified Data Only 

[0016] To reduce the amount of R/W data- which has to be saved, wo noe - d to the method may 
keep track of the modification state of the different entities. This incurs may incur a silicon area 
penalty for the additional "dirty" bits. 
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Use Caching to Reduce the Memory Traffic 
[0017] The configuration manager handlo s may handle manual preloading of configurations. 
Preloading will mav help in parallelizing the memory transfers with other computations during 
the task switch. This cache can also reduce the memory traffic for frequent context switches, 
5 provided that a Least Recently Used (LRU) replacement strategy is implemented in addition to 
the preload mechanism. 

[0018] The IRAMs can be defined to be local cache copies of main memory as proposed as 
fourth method in section 2.2.3 . discussed above under the heading "Shared Memory (IRAM)." 

1 0 Then each IRAM is may be associated with a starting address and modification state 

information. The IRAM memory cells ar emay be replicated. An IRAM PAE contains m ay 
contain an IRAM block with multiplev IRAM instances.-Onl y It may be that only the starting 
addresses of the IRAMs have to be saved and restored as context. The starting addresses for the 
IRAMs of the current configuration select the IRAM instances with identical addresses to be 

15 used. 

[0019] If no address tag of an IRAM instance matches the address of the newly loaded context, 
the corresponding memory area t smay be loaded to an empty IRAM instance. 

20 [0050] If no empty IRAM instance is available, a clean (unmodified) instance is may be declared 
empty (and hence ams-t it may be required for it to be reloaded later on). 

[0051] If no clean IRAM instance is available, a modified (dirty) instance is -may be cleaned by 
writing its data back to main memory. This add smay add a certain delay for the write- back. 

25 

[0052] This delay can be avoided^ if a separate state machine (cache controller) tries to clean 
inactive IRAM instances by using unused memory cycles to write- back the IRAM instances' 
contents. 

30 2r4-Context Switches 

[0053] Usually a processor is viewed as executing a single stream of instructions. But today's 
multi z tasking operating systems support hundreds of tasks being executed on a single 
processor. This is-j achieved by switching contexts, where all, or at least the most relevant parts.! 
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of the processor state? which belong to the current task — the task's context — is exchanged with 
the state of another task, that will be executed next. 



[0051] There are three types of context switches: switching of virtual processors with 
5 simultaneous multithreading (SMT, also known as Hyper Threading), execution of an Interrupt 
Service Routine (ISR)^ and a Task Switch. 

2.1.1 SMT Virtual Processor Switch 

[0055] This type of context switch is may be executed without software interaction, totally in 
10 hardware. Instructions of several instruction streams are merged into a single instruction stream 
to increase instruction level parallelism and improve functional unit utilization. Hence^ the 
processor state cannot be stored to and reloaded from memory between instructions from 
different instruction streams : Imagine the worst case . For example, in an instance of alternating 
instructions from two streams and the-hundreds to thousand thousands of cycles might be n eeded 
15 to write the processor state to memory and read in another state. 

[0056] Hence hardware designers have to replicate the internal state for every virtual processor. 
Every instruction i smay be executed within the context (on the state) of the virtual processor? 
whose program counter was used to fetch the instruction. By replicating the state, only the 
20 multiplexers, which have to be inserted to select one of the different states, have to be switched. 

[0057] Thus the size of the state may also incroasos increase the silicon area needed to 
implement SMT, so the size of the state i smay be crucial for many design decisions. 

2.1.2 Interrupt Service Routine 

25 [0058] This type of context switch is may be handled partially by hardware and partially by 
software .-A H It may be required for all of the state modified by the ISR has-to be saved on 
entry and mus tit may be required for it to be restored on exit. 

[0059] The part of the state? which is destroyed by the jump to the ISRh s may be saved by 
30 hardware.! (e.g.,, the program counter). It is may be the ISR's responsibility to save and restore 
the state of all other resources, that are actually used within the ISR. 
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[0060] The more state information to be saved, the slower the interrupt response time wil imav 
be and the greater- the performance impact wiH- may be if external events trigger interrupts at a 
high rate. 



5 [0061] The execution model of the instructions witt may also affect the tradeoff between short 
interrupt latencies and maximum throughput-H^Throughput i smay be maximized if the 
instructions in the pipeline are finished? and the instructions of the ISR are chained. This may 
adversely affects affect the interrupt latency. If, however, the instructions are abandoned 
( proompto d pre-empted ) in favor of a short interrupt latency, thoy must it may be required for 
1 0 them to be fetched again later, which affocts may affect throughput. The third possibility would 
be to save the internal state of the instructions within the pipeline, but this requires m ay require 
too much hardware effort. Usually this is not done. 

2t4t3- Task Switch 

15 [0062] This type of context switch is may be executed totally in software.^Al l It may be 

required for all of a task's context (state) feas-to be saved to memory, and it may be required for 
the context of the new task has-to be reloaded. Since tasks are usually allowed to use all of the 
processor's resources to achieve top performance, it may be required to save and restore all of 
the processor state has to bo saved and restored . If the amount of state is- excessive, it may be 

20 required for the rate of context switches mustt o be decreased by less frequent rescheduling, or a 
severe throughput degradation wiHmay result, as most of the time w4H -may be spent in saving 
and restoring task contexts. This in turn incroases may increase the response time for the tasks. 

2-r5-A Load Store Architecture 

25 [0063] We propose ln an example embodiment of the present invention, an XPP integration may 
be provided as an asynchronously pipelined functional unit for the RISC. Wo further propose an 
An explicitly preloaded cache may be provided for the IRAMs, on top of the memory hierarchy 
existing within the RISC (as proposed as fourth method in section 2.2.3). discussed above under 
the heading "Shared Memory (IRAMV Additionally a de-centralized explicitly preloaded 

30 configuration cache within the PAE array i smay be employed to support preloading of 
configurations and fast switching between configurations. 

[0061] Since the IRAM content is an explicitly preloaded memory area, a virtually unlimited 
number of such IRAMs can be used. They ap emay be identified by their memory address and 
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their size. The IRAM content i smay be explicitly preloaded by the application. Caching 
will- may increase performance by reusing data from the memory hierarchy. The cached 
operation may also eliminates eliminate the need for explicit store instructions; they are may be 
handled implicitly by cache write- back operations but can also be forced for synchronization. 
5 to synchronize with the RISC. 

[0065] The pipeline stages of the XPP functional unit ar emay be Load, Execute.! and Write back 
Back (Store). The store i smay be executed delayed as a cache write- back. The pipeline stages 
may execute in an asynchronous fashion, thus hiding the variable delays from the cache 
10 preloads and the PAE array. 

[0066] The XPP functional unit i smay be decoupled of the RISC by a FIFO , which is fed with 
the XPP instructions. At the head of this FIFO, the XPP PAE consumos may consume and 
cxccutos execute the configurations and the preloaded IRAMs. Synchronization of the XPP and 
15 the RISC i smay be done explicitly by a synchronization instruction. 

Instructions 

[0067] In the following wo define thc Embodiments of the present invention may require certain 
instruction formats needed for the proposed architecture. Wo use . Data types may be specified 
20 using a C style prototype definition to specify data types. All instructions, except the XppSync 
instruction . The following are example instruction formats which may be required, all of which 
execute asynchronously . The XppSync , except for an XPP Sync instructio n, which can be used 
to force synchronization. 

25 [00681 XppPrcloadConfi g XPPPreloadConfig (void *ConfigurationStart Address) 
[0069] The configuration i smay be added to the preload FIFO to be loaded into the 
configuration cache within the PAE array. 

[0070] N ote that speculative preloads areis possible^ since successive preload commands 
30 overwrite the previous. 

[0071] The parameter is a pointer register of the RISC pointer register file. The size is 
implicitly contained in the configuration XPPPreload (int IRAM, void *StartAddress, int Size) . 
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r00721 XppProload X PPPreloadClean (int IRAM, void *StartAddress, int Size) 



[0073] XppPrcloadClcan (int IRAM, void *StartAddrcso, int Size) 
[0071] This instruction specifies may specify the contents of the IRAM for the next 
5 configuration execution. In fact, the memory area i smay be added to the preload FIFO to be 
loaded into the specified IRAM. 

[0075] The first parameter i smay be the IRAM number. This is- may be an immediate (constant) 
value. 

10 

[0076] The second parameter i smay be a pointer to the starting address. This parameter i smay 
be provided in a pointer register of the RISC pointer register file. 

[0077] The third parameter is may be the size in units of 32 bit words. This i smay be an integer 
1 5 value. It resides may reside in a general- purpose register of the RISC's integer register file. 

[0078] The first variant may actually proloads preload the data from memory. 

[0079] The second variant i smay be for write-only accesses. It sfep smay skip the loading 
20 operation. Thus , it may be that no cache misses can occur for this IRAM. Only the address and 
size are defined. They are obviously needed for the write- back operation of the IRAM cache. 

[0080] N ote that speculative preloads are possible- since successive preload commands to the 
same IRAM overwrite each other (if no configuration is executed in between). ThuSi only the 
25 last preload command i smay be actually effective^ when the configuration is executed. 

[0081] XppExccutc () 

XPPExecute O 

30 [0082] This instruction oxocutos may execute the last preloaded configuration with the last 
preloaded IRAM contents. Actually i a configuration start command is may be issued to the 
FIFO. Then the FIFO i smay be advanced ; this means . This may mean that further preload 
commands will specify the next configuration or parameters for the next configuration. 
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Whenever a configuration finishes, the next one i smay be consumed from the head of the FIFO, 
if its start command has already been issued. 

r00831 XppSvnc XPPSvnc (void *StartAddress i int Size) 
5 [0081] This instruction forcos may force write- back operations for all IRAMs that overlap the 
given memory area. 

[0085] The first parameter is a pointer to the starting address. This parameter is provided in a 
pointer register of the RISC pointer register file. 

10 

[0086] The second parameter is the size. This is an integer value. It resides in a general-purpose 
register of the RISC's integer register file. 

[0087] If overlapping IRAMs are still in use by a configuration or preloaded to be used, this 
15 operation will block. Giving an address of NULL (zero) and a size of MAX INT (bigger than 
the actual memory), this instruction can also be used to wait until all issued configurations 
finish. 

[0088] Giving a size of zero can be used as a simple wait for the end of the configuration. 

20 

[0089] XppSave (void * Start Address) [ 

[0090] This instruction saves the task context of the XPP to the given memory area. 

25 [0091] The parameter is a pointer to the starting address. This parameter is provided in a pointer 
register of the RISC pointer register file. 

[0092] The., size depends on the actual implementation of the XPP. However, only the task 
scheduler of the operating system will use this instruction. So this is a usual limitation. 

30 

[0093] XppRestore (void *StartAddress) 

[0091] This instruction restores the task context of the XPP from the givem memory area. 
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[0095] The parameter is a pointer to the starting address. This parameter is provided in a pointer 
register of the RISC pointer register file. 

[0096] The size depends on the actual implementation of the XPP. However, only the task 
5 scheduler of the operating system will use this instruction. So this is a usual limitation. 

2.5.1 A Basic Implementation 

[0097] The XPP core shares the memory hierarchy^ with the RISC core using a special cache 
controller (see FIG Fig . 4-1) ^ 

10 

[0098] The preload-FIFOs in FIGFig. 2 may contain the addresses and sizes for already issued 
IRAM preloads, exposing them to the XPP cache controller. The FIFOs may have to be 
duplicated for every virtual processor in an SMT environment. _^Tag^ is the typical tag for a 
cache line containing starting address, size^ and state (empty / clean / dirty / in-use). The 
1 5 additional in-use state signals usage by the current configuration. The cache controller cannot 
manipulate these IRAM instances. 

[0099] The execute configuration command advancos m ay advance all preload FIFOs, copying 
the old state to the newly created entry. This way the following preloads may replace the 
20 previously used IRAMs and configurations. If no preload is issued for an IRAM before the 
configuration is executed, the preload of the previous configuration is may be retained. 
Therefore , it may be that it is not necessary to repeat identical preloads for an IRAM in 
consecutive configurations. 

25 [0100] Each configuration's execute command has may have to be delayed (stalled) until all 
necessary preloads are finished, either explicitly by the use of a synchronization command or 
implicitly by the cache controller. Hence the cache controller (XPP Ld/St unit) ha s!25 may 
have to handle the synchronization and execute commands as well, actually starting the 
configuration as soon as all data is ready. After the termination of the configuration, dirty 

30 IRAMs af emay be written back to memory as soon as possible^ if their content is not reused in 
the same IRAM. Therefore the XPP PAE array (XPP core 102) and the XPP cache controller 
125 can be seen as a single unit since they do not have different instruction streams : rather^ 
Rather , the cache controller can be seen as the configuration fetch (CF), operand fetch (OF) 
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(IRAM preload) and write- back (WB) stage of the XPP pipeline, also triggering the execute 
stage (EX) (PAE array), (see FJGFig. 3) i 

[0101] Due to the long latencies, and their non-predictability (cache misses, variable length 
5 configurations), the stages can be overlapped several configurations wide using the 
configuration and data preload FIFC^ (— i.e.. p ipeline), for loose coup ling .4>e4f If a 
configuration is executing and the data for the next has already been preloaded, the data for the 
next but one configuration is may be preloaded. These preloads can be speculative-i-th e. The 
amount of speculation is may be the compiler's trade-off. The reasonable length of the preload 

10 FIFO can be several configurationsT4^4 s. It may be limited by diminishing returns, algorithm 
properties, the compiler's ability to schedule preloads early and by silicon usage due to the 
IRAM duplication factor, which has may have to be at least as big as the FIFO length. Due to 
this loosely coupled operation, the interlocking— _(to avoid data hazards between IRAMs—) 
cannot be done optimally by software (scheduling), but ha smay have to be enforced by 

15 hardware (hardware interlocking). Hence the XPP cache controller and the XPP PAE array can 
be seen as separate but not totally independent functional units. 

[0102] The XPP cache controller has may have several tasks. These are depicted as states in the 
diagram shown in FIG Fig . 4. State transitions may t ake place along the edges between states, 
20 whenever the condition for the edge is true. As soon as the condition is not true any more, the 
reverse state transition tak-e smay take place. The activities for the states a^ emay be as follows^ 

[0103] At the lowest priority, the XPP cache controller ha s 125 may have to fulfill already 
issued preload commands, while writing back dirty IRAMs as soon as possible. 

25 

[0101] As soon as a configuration finishes, the next configuration can be started. This is a more 
urgent task than write- backs or future preloads. To be able to do that, all associated yet 
unsatisfied preloads may have to be finished first. ThuSi they af emay be preloaded with the 
high priority inherited from the execute state. 

30 

[0105] A preload in turn can be blocked by an overlapping in-use or dirty IRAM instance in a 
different block or by the lack of empty IRAM instances in the target IRAM block. The former 
can be resolved by waiting for the configuration to finish and / or by a write- back. To resolve 
the latter, the least recently used clean IRAM can be discarded, thus becoming empty. If no 
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empty or clean IRAM instance exists, a dirty one ha smay have to be written back to the 
memory hierarchy. It cannot occur that no empty, clean, or dirty IRAM instances exist, since 
only one instance can be in-use and there should be more than one instance in an IRAM block— i 
otherwise,, no caching effect is achieved. 

5 

[0106] In an SMT environment the load FIFOs may have to be replicated for every virtual 
processor. The pipelines of the functional units are may be fed from the shared fetch / reorder / 
issue stage. All functional units may execute in parallel. Different units can execute 
instructions of different virtual processors. 

10 

[0107] So we get the following design parameters., with their smallest initial value , may be 
obtained : 

[0108] ! IRAM length: 1 28 

words [0109] 

1 5 The longer the IRAM length, the longer the running time of the configuration and the less 
influence the pipeline startup has. 
[0110] * FIFO length: 1 roll 11 1 

This parameter heip smay help to hide cache misses during preloading-H^The longer the FIFO 
length, the less disruptive is a series of cache misses for a single configuration. 
20 [01 12] * IRAM duplication factor: (pipeline stages+caching factor)* virtual processors: 3 [0113] 
3 

Pipeline stages is the number of pipeline stagesv LD/EX/WB plus one for every FIFO stage 
above one: 3 [0111] 

Caching factor is the number of IRAM duplicates available for caching: 0 [01 15] _ 0 
25 Virtual processors is the number of virtual processors with SMT: 1 

[01 16] The size of the state of a virtual processor is mainly dependent on the FIFO length. It isv 
[0117] 

FIFO length * #IRAM ports * (32 bit (Address) + 32 bit (Size)) 4 

30 

[0118] This has may have to be replicated for every virtual processor. 

[0119] The total size of memory used for the IRAMs is: [0120] may be: 
#IRAM t>orts *IRAM * 1 1 RAM duplication factor* IRAM length * 32 bit i 
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[0121] A first implementation will probably keep close to the above-stated minimum 
parameters, using a FIFO length of one, an IRAM duplication factor of four, an IRAM length of 
128 and no simultaneous multithreading. 

5 

2.5.2 Implementation Improvements 
Write Pointer 

[0122] To further decrease the penalty for unloaded IRAMs, a simple write pointer may be used 
per IRAM, which k-eep smay keep track of the last address already in the IRAM. Thus^ no stall 
10 is required, unless an access beyond this write pointer is encountered. This is rnay be especially 
usefuL- if all IRAMs have to be reloaded after a task switch-H^The delay to the configuration 
start can be much shorter, especially, if the preload engine of the cache controller chooses the 
blocking IRAM next whenever several IRAMs need further loading. 



1 5 Longer FIFOs 

[0123] The frequency at the bottom of the memory hierarchy (main memory) cannot be raised 
to the same extent as the frequency of the CPU core. To increase the concurrency between the 
RISC core 1 12 and the PACT XPP core T 102. the prefetch FIFOs in the above drawing can be 
extended. Thus^ the IRAM contents for several configurations can be preloaded, like the 

20 configurations themselves. A simple convention makes clear which IRAM preloads belong to 
which configuration-Hth e. The configuration execute switches to the next configuration context. 
This can be accomplished by advancing the FIFO write pointer with every configuration 
execute, while leaving it unchanged after every preload. Unassigned IRAM FIFO entries may 
keep their contents from the previous configuration, so every succeeding configuration wiU may 

25 use the preceding configuration's IRAMx if no different IRAMx was preloaded. 



[0121] If none of the memory areas to be copied to IRAMs is in any cache, extending the FIFOs 
does not help, as the memory is the bottleneck. So the cache size should be adjusted together 
with the FIFO length. 

[0125] A drawback of extending the FIFO length is the increased likelihood that the IRAM 
content written by an earlier configuration is reused by a later one in another IRAM. A cache 
coherence protocol can clear the situation. Note^ however that the situation can be resolved 
more easily-e-^If an overlap between any new IRAM area and a currently dirty IRAM contents 
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of another IRAM bank is detected, the new IRAM is simply not loaded until the write- back of 
the changed IRAM has finished. ThuSi the execution of the new configuration is -may be 
delayed until the correct data is available. 

[0126] For a short (single entry) FIFO, an overlap is extremely unlikely, since the compiler will 
usually leave the output IRAM contents of the previous configuration in place for the next 
configuration to skip the preload. The compiler dee smay do so using a coalescing algorithm for 
the IRAMs / vector registers. The coalescing algorithm i smay be the same as used for register 
coalescing in register allocation. 

Read Only IRAMs IRAMS 
[0127] Whenever the memory^ that is used by the executing configuration^ is the source of a 
preload command for another IRAM, an XPP pipeline stall occurs : may occur. The preload can 
only be started- when the configuration has finished- and— ^if the content was modified— ^the 
memory content has been written to the cache. To decrease the number of pipeline stalls, it 
i smay be beneficial to add an additional read- only IRAM state. If the IRAM is read only, the 
content cannot be changed, and the preload of the data to the other IRAM can proceed without 
delay. This rcquircs may require an extension to the preload instructions-H^The XppPreload and 
the XppPreloadClean instruction formats can be combined to a single instruction format^ that 
has two additional bits-; stating whether the IRAM will be read and/or written. To support 
debugging, violations should be checked at the IRAM ports, raising an exception when needed. 

2.5.3 Support for Data Distribution and Data Reorganization 

[0128] The IRAMs ar emay be block-oriented structures, which can be read in any order by the 
PAE array. However, the address generation add smay add complexity, reducing the number of 
PAEs available for the actual computation. So it is best, i f Accordingly, the IRAMs af emay be 
accessed in linear order. The memory hierarchy i smay be block oriented as well, further 
encouraging linear access patterns in the code to avoid cache misses. 

[0129] As the IRAM read ports limit the bandwidth between each IRAM and the PAE array to 
one word read per cycle, it can be beneficial to distribute the data over several IRAMs to 
remove this bottleneck. The top of the memory hierarchy is the source of the data, so the 
amoun t number of cache misses never increases when the access pattern is changed, as long as 
the data locality is not destroyed. 
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[0130] Many algorithms access memory in linear order by definition to utilize block reading 
and simple address calculations. In most other cases and in the cases where loop tiling is 
needed to increase the data bandwidth between the IRAMs and the PAE array, the code can be 
5 transformed in a way that data is accessed in optimal order. In many of the remaining cases, the 
compiler modify the access pattern by data layout rearrangements.! (e.g.i array 

merging), so that finally the data is accessed in the desired pattern. If none of these 
optimizations can be used because of dependences dependencies , or because the data layout is 
fixed, there are still two possibilities to improve performance s, which are data duplication and 
10 data reordering. 

Data Duplication 

[0131] Data i smay be duplicated in several IRAMs. This circumvonts may circumvent the 
IRAM read port bottleneck, allowing several data items to be read from the input every cycle. 

15 

[0132] Several options are possible with a common drawbacks-dat a. Data duplication can only 
be applied to input data : outpu t . Output IRAMs obviously cannot have overlapping address 
ranges. [0133] 

20 * Using several IRAM preload commands specifying just different target IRAMs: [0131] 

This way cache misses may occur only for the first preload. All other preloads will mav take 
place without cache misses — only . Only the time to transfer the data from the top of the 
memory hierarchy to the IRAMs is needed for every additional load. This is only beneficial- if 
the cache misses plus the additional transfer times do not exceed the execution time for the 

25 configuration. [0135] 

* Using an IRAM preload instruction to load multiple IRAMs concurrently: [0136] 
As identical data is needed in several IRAMs, they can be loaded concurrently by writing the 
same values to all of them. This amounts to finding a clean IRAM instance for every target 
30 IRAM, connecting them all to the bus^ and writing the data to the bus. The problem with this 
instruction i smay be that it requires a bigger immediate field for the destination (16 bits instead 
of 4 for the XPP 64). Accordingly^ this instruction format grows m ay grow at a higher rate^ 
when the number of IRAMs is increased for bigger XPP arrays. [0137] 
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The interface of this instruction looks like: fOl 381 XppProloadMultiplo is for example: 
XPPPreloadMultiple (int IRAMS, void * Start Address, int Size) [0139] ^ 



This instruction behaves may behave as the XppPreload/XppPreloadClea n XPPPreload / 
5 XPPPreloadClean instructions with the exception of the first parameter : [0110] ^Thc first 
parameter is IRAMS. This i smay be an immediate (constant) value. The value i smay be a 
bitmap— fe r. For every bit in the bitmap, the IRAM with that number i smay be a target for the 
load operation. [0111] 

10 There is no "clean" version, since data duplication is applicable for read data only. 
Data Reordering 

[0112] Data reordering changes the access pattern to the data only. It does not change the 
amount of memory that is read. Thus,, the number of cache misses stay smay stay the same. 
15 [0113] 

* Adding additional functionality to the hardware: [0111] 
> Adding a vector stride to the preload instruction. [0115] 

A stride (displacement between two elements in memory) i smay be used in vector load 
operations to loadi e.g.^ a column of a matrix into a vector register. [0116] 

20 

This is a non sequential but still a_linear access pattern. It can be implemented in hardware by 
giving a stride to the preload instruction and adding the stride to the IRAM identification state. 
One problem with this instruction i smay be that the number of possible cache misses per IRAM 
load risesf-^In the worst case it can be one cache miss per loaded value T if the stride is equal to 
25 the cache line size and all data is not in the cache. [0117] But as already stated-^ the total 
number of misses stays the same jus t . Just the distribution changes. Stilly this is an 
undesirable effect. [0118] 

The other problem is- may be the complexity of the implementation and a possibly limited 
30 throughput, as the data paths between the layers of the memory hierarchy are optimized for 
block transfers. Transferring non-contiguous words will not use wide busses in an optimal 
fashion. [0119] 

The interface of the instruction looks like: [0150] XppProloadStrido is for example: 
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XPPPreloadStride (int IRAM, void * Start Address, int Size, int Stride) [0151] 
XppProloadCloanStrido 



XPPPreloadCleanStride (int IRAM, void * Start Address, int Size, int Stride) [0152] ^ 

5 

This instruction bohavos may behave as the XppProloadOCppProloadCloan X PPPreload / 
XPPPreloadClean instructions with the addition of another parameter : [0153] ^ The fourth 
parameter is the vector stride. This i smay be an immediate (constant) value. It tefe may tell the 
cache controller-; to load only every n.sup.th nth v alue to the specified IRAM. [015-1] 

10 

* Reordering the data at run time, introducing temporary copies. [0155] 
> On the RISC: [0156] 

The RISC can copy data at a maximum rate of one word per cycle for simple address 
computations and at a somewhat lower rate for more complex ones. [0157] 

15 

With a memory hierarchy, the sources wiH -may be read from memory (or cache, if they were 
used recently) once and written to the temporary copy, which wiH may then reside in the cache, 
too. This incrcascs m ay increase the pressure in the memory hierarchy by the amount of 
memory used for the temporaries. Since temporaries are allocated on the stack memory, which 
20 is -may be re-used frequently, the chances are good that the dirty memory area is re- 

defined redefined before it is written back to memory. Hence the write- back operation to 
memory is of no concern. [0158] 

* Via an XPP configuration: [0159] 

25 The PAE array can read and write one value from every IRAM per cycle. ThuSi if half of the 
IRAMs are used as inputs and half of the IRAMs are used as outputs, up to 
eight (or more, depending on the number of IRAMs),, values can be reordered per cycle, using 
the PAE array for address generation. As the inputs and outputs reside in 
IRAMs, it does not matter^ if the reordering is done before or after the configuration 

30 that uses the data the . The IRAMs can be reused immediately. 

IRAM Chaining 

[0160] If the PAEs do not allow further unrolling, but there are still IRAMs left unused, it i smay 
be possible to load additional blocks of data into these IRAMs and chain two IRAMs by moans 
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efvia an address selector. This doc - s m ight not increase throughput as much as unrolling would 
do, but it still helps may help to hide long pipeline startup delays whenever unrolling is not 
possible. 



5 3r6-Software / Hardware Interface 

[0161] According to the design parameter changes and the corresponding changes to the 
hardware, according to embodiments of the present invention, t he hardware / software interface 
has changed. In the following the most , some prominent changes and their handling will be are 
discussed^ 

10 

2.6.1 . Explicit Cache 

[0162] The proposed cache is not a usual cache, which would be— ne t, without considering 
performance issues— ^invisible to the programmer / compiler, as its operation is transparent. 
The proposed cache is an explicit cache. Its state ha smay have to be maintained by software. 

15 

Cache Consistency and Pipelining of Preload / Configuration / Write Back back 
[0163] The software i smay be responsible for cache consistency. It is may be possible to have 
several IRAMs caching the same^ or overlapping memory areas. As long as only one of the 
IRAMs is written, this is perfectly okr-^Only this IRAM will be dirty and will be written back 
20 to memory. If^ however more than one of the IRAMs is written, it is not defined, which data 
will be written to memor y is not defined. This is a software bug (non ^deterministic behavior). 

[0161] As the execution of the configuration is overlapped with the preloads and write- backs of 
the IRAMs, it i smay be possible to create preload / configuration sequences-^ that contain data 
25 hazards. As the cache controller and the XPP array can be seen as separate functional units, 
which are effectively pipelined, these data hazards are equivalent to pipeline hazards of a 
normal instruction pipeline. As with any ordinary pipeline, there are two possibilities to resolve 
this : [01651 . which are hardware interlocking and software interlocking. 

30 * Hardware interlocking: [0166] 

Interlocking i smay be done by the cache controller-^^If the cache controller detects^ that the tag 
of a dirty or in-use item in IRAMx overlaps a memory area used for another IRAM preload, it 
has -may have to stall that preload, effectively serializing the execution of the current 
configuration and the preload. [0167] 
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* Software interlocking: [0168] 

If the cache controller does not enforce interlocking, the code generator ha smay have to insert 
explicit synchronize instructions to take care of potential interlocks. Inter- procedural and inter- 
5 modular alias- and data dopcndcncc dependency analyses can determine if this is the case, 

while scheduling algorithms may help to alleviate the impact of the necessary synchronization 
instructions. 

[0169] In either case, as well as in the case of pipeline stalls due to cache misses, SMT can use 
1 0 the computation power- that would be wasted otherwise. 

Code Generation for the Explicit Cache 
[0170] Apart from the explicit synchronization instructions issued with software interlocking, 
the following instructions may have to be issued by the compiler. [0171] 

15 

* Configuration preload instructions, preceding the IRAM preload instructions, that will be used 
by that configuration. These should be scheduled as early as possible by the instruction 
scheduler. [0172] 

20 * IRAM preload instructions, which should also be scheduled as early as possible by the 
instruction scheduler. [0173] 

* Configuration execute instructions, following the IRAM preload instructions for that 
configuration. These instructions should be scheduled between the estimated minimum and the 

25 estimated maximum of the cumulative latency of their preload instructions. [0171] 

* IRAM synchronization instructions, which should be scheduled as late as possible by the 
instruction scheduler. These instructions must be inserted before any potential access of the 
RISC to the data areas that are duplicated and potentially modified in the IRAMs. Typical^ 

30 these instructions will follow a long chain of computations on the XPP, so they will not 
significantly decrease performance. 

Asynchronicity to Other Functional Units 
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[0175] An XppSync0 must be issued by the compiler, if an instruction of another functional 
unit (mainly the Ld/St unit) can access a memory area- that is potentially dirty or in-use in an 
IRAM. This forces may force a synchronization of the instruction streams and the cache 
contents, avoiding data hazards. A thorough inter-procedural and inter-modular array alias 
5 analysis -hmi temay limit the frequency of these synchronization instructions to an acceptable 
level. 

2-rT-Another Implementation 

[0176] For the previous design, the IRAMs are existent in silicon, duplicated several times to 
1 0 keep the pipeline busy. This amounts may amount to a large silicon area, that is not fully busy 
all the time, especially, when the PAE array is not used, but as well whenever the configuration 
does not use all of the IRAMs present in the array. The duplication may also makes make it 
difficult to extend the lengths of the IRAMs, as the total size of the already large IRAM area 
scales linearly. 

15 

[0177] For a more silicon efficient implementation, we should integrate the IRAMs may be 
integrated into the first level cache, making this cache bigger. This means^ that wo have to 
extend the first level cache controller is extended to feed all IRAM ports of the PAE array. This 
way the XPP and the RISC wil knay share the first level cache in a more efficient manner. 
20 Whenever the XPP is executing, it wil lmav steal as much cache space as it needs from the 
RISC. Whenever the RISC alone is running it will have plenty of additional cache space to 
improve performance. 

[0178] The PAE array has- may have the ability to read one word and write one word to each 
25 IRAM port every cycle. This can be limited to either a read or a write access per cycle, without 
limiting programmability-e-^If data has to be written to the same area in the same cycle, another 
IRAM port can be used. This incrcasos may increase the number of used IRAM ports, but only 
under rare circumstances. 

30 [0179] This leaves sixteen data accesses per PAE cycle in the worst case. Due to the worst case 
of all sixteen memory areas for the sixteen IRAM ports mapping to the same associative bank, 
the minimum associativity for the cache i smay be a 1 6-way set associativity. This avoids may 
avoid cache replacement for this rare, but possible.! worst-case example. 



NY01 1641442 



34 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



[0180] Two factors may help to support sixteen accesses per PAE array cycle: [0181] 

* The clock frequency of the PAE array generally has to be lower than for the RISC by a factor 
of two to four. The reasons lie in the configurable routing channels with switch matrices which 
cannot support as high a frequency as solid point-to-point aluminiu m aluminum or copper traces. 

5 [0182] 

This means that two to four IRAM port accesses can be handled serially by a single cache port, 
as long as all reads are serviced before all writes, if there is a potential overlap. This can be 
accomplished by assuming a potential overlap and enforcing a priority ordering of all accesses, 
1 0 giving the read accesses higher priority. [0183] 

* A factor of two, four^ or eight is possible by accessing the cache as two, fou^ or eight banks 
of lower associativity cache. [0181] 

15 For a cycle divisor of four, four banks of four- way associativity will be optimal. During four 
successive cycles, four different accesses can be served by each bank of four- way associativity 
can servo four different accesses . Up to four-way data duplication can be handled by using 
adjacent IRAM ports that are connected to the same bus (bank). For further data duplication, 
the data has may have to be duplicated explicitly, using an XppPreloadMultipleQ cache 

20 controller instruction. The maximum data duplication for sixteen read accesses to the same 

memory area is-; supported by an actual data duplication factor of four^ one copy in each bank. 
This does not affect the cache RAM efficiency as adversely as an actual data duplication of 1 6 
for the design proposed in section 2.5. embodiment discussed above under the heading "A Load 
Store Architecture." 

25 

[0185] The cache controller is runnin g may run at the same speed as the RISC. The XPP is 
runnin g may run at a lower., (e.g..,. quarter),, speed. This wa y Accordingly, in the worst case-ef, 
sixteen read requests from the PAE array need to m ay be serviced in four cycles of the cache 
controller, with an additional four read requests from the RISC. -So Accordingly, one bus at full 
30 speed can be used to service four IRAM read ports. Using four- way associativity, four accesses 
per cycle can be serviced, even in the case that all four accesses go to addresses that map to the 
same associative block. 
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fO 1 861 a) The RISC still has a 16-way set associative view of the cache, accessing all four four- 
way set associative banks in parallel. Due to data duplication.! it is possible- that several banks 
return a hit. This has to m ay be taken care of with a priority encoder, enabling only one bank 
onto the data bus. 

5 

[0187] b) The RISC is blocked from the banks that service IRAM port accesses. Wait states are 
inserted accordingly. The impact of wait states is reduced, if the RISC shares the second cache 
access port of a two-port cache with the RAM interface, using the cycles between the RAM 
transfers for its accesses. 

10 

[0188] Another A problem is that one IRA Ma read could potentially address the same memory 
location as a write from another IRAM ; the . The value read depends may depend on the order 
of the oporations, operation so that the order must bo is fixed ^, i.e., all writes have to take place 
after all reads, but before the reads of the next cycle . This can bo relaxed , except , if the reads 
15 and writes actually do not overlap. However a simple priority scheme for the bus accesses 

enforces the correct ordering of the accesses. [01891 The This can only be a proble m of read 
write consistency is more severe with data duplication, when only one copy of the data is 
actually modified. Therefore.! modifications are forbidden with data duplication. 

20 2.7.1 Programming Model Changes 
Data Interference 

[01901 With this dosign A ccording to an example embodiment of the present invention that is 
without dedicated IRAMs, it is not possible any moro anymore to load input data to the IRAMs 
and write the output data to a different IRAM, which is mapped to the same address, thus 
25 operating on the original, unaltered input data during the whole configuration. 

[0191] As there are no dedicated IRAMs any moro anymore . writes directly modify the cache 
contents, which will be read by succeeding reads. This changes the programming model 
significantly. Additional and more in-depth compiler analyses are necessary accordingly 
30 necessary . 

2.7.2 Hiding Implementation Details 

[0192] The actual number of bits in the destination field of the XppPreloadMultiple instruction 
is implementation dependent. It depends on the number afcache banks and their associativity, 
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which are determined by the clock frequency divisor of the XPP PAE array relative to the cache 
frequency. However, this can be hidden by the assemble r can hide this by translatin g , which 
may translate IRAM ports to cache banks, thus reducing the number of bits from the number of 
IRAM ports to the number of banks. For the user^ it is sufficient to know^ that each cache bank 
5 services an adjacent set of IRAM ports starting at a power of two. ThuSi it is- may be best to use 
data duplication for adjacent ports, starting with the highest power of two biggo r greater than the 
number of read ports to the duplicated area. 

10 3-PROGRAM OPTIMIZATIONS 
3J-Code Analysis 

fOl 931 In this section we describe the analyses that can Analyses may be performed on 
programs . Those analyses aro then used by different optimizations. Thoy to describe the 
relationships between data and memory locations location in tfeea program. These analyses may 
1 5 then be used by different optimizations. More details can be found in several books [2,3,5]. 
regarding the analyses are discussed in Michael Wolfe. "High Performance Compilers for 
Parallel Computing" (Addison- Wesley 1996); Hans Zima & Barbara Chapman. 
"Supercompilers for parallel and vector computers" (Addison- Wesley 1991); and Steven 
Muchnick. "Advanced Compiler Design and Implementation" (Morgan Kaufmann 1997). 

20 

3.1.1 Dataflow Data-Flow Analysis 

[0191] Dataflo w Data-flow analysis examines the flow of scalar values through a program- to 
provide information about how the program manipulates its data. This information can be 
represented by dataflow equations operating on sets. A dataflow equation t hat have the 

25 following general form for object i, that can be an instruction or a basic block, is formulated as 
depending on the problem to solve: 

Ex[i] = Gen[i] Y (In[i] - Kill[i]) , 
r01951 It This means that data available at the end of the execution of object i, Ex[i], are either 
produced by i, Gen[i] or were alive at the beginning of i, In[i], but were not deleted during the 

30 execution of j-i, Kill[i]. 

[0196] These equations can be used to solve several problems like: [01971 , such as, e.g.. 
* the problem of reaching definitions , [0198k 
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* the Def-Use and Use-Def chains, describing respectively, for a definition^ all uses that can be 
reached from it, and A for a use^ all definitions that can reach i t, respectively . [0 1 99]; 

* the available expressions at a point in the progra m, [02001 ; and/or 

* the live variables at a point in the program, 

5 whose solutions are then used by several compilation phases, analysis, or optimizations. 

r02011 As an For example let us take the , with respect to a problem of computing the Def-Use 
chains of the variables of a program^Fhis , this information can be used for instance by the data 
dependence analysis for scalar variables or by the register allocation. A Def-Use chain is 

1 0 associated to each definition of a variable and is the set of all visible uses from this definition. 
The dataflow data-flow equations presented above af emay be applied to the basic blocks to 
detect the variables that are passed from one block to another along the control- flow graph. In 
the figure below, two definitions for variable x are produced: SI in Bl and S4 in B3. Hence^ 
the variable that can be found at the exit of Bl is Ex(Bl) = (x(Sl)}-; and at the exit of B4 is 

15 Ex(B4) = (x(S4)}. Moreove r we have a Ex(B2) = Ex(Bl) as no variable is defined in B2. Using 
these sets, we find it is the case that the uses of x in S2 and S3 depend on the definition of x in 
B- bl and that the use of x in S5 dcpcn d depends on the definitions of x in Bl and B3. The Def- 
Useuse chains associated with the definitions are then D(S 1) = {S2, S3, S5} and D(S4) = {S5} . 

20 [0202] The Control-flow graph of a piece of program is shown in FIG Fig . 7. 

3.1.2 Data Dependence Analysis 

[0203] A data dependence graph represents the dopondoncos dependencies existing between 
operations writing or reading the same data. This graph i smay be used for optimizations like 

25 scheduling, or certain loop optimizations to test their semantic validity. The nodes of the graph 
represent the instructions, and the edges represent the data dcpcndoncos. dependencies. These 
dcpcndcnccs dependencies can be of three types: true (or flow) dependence when a variable is 
written before being read, anti-dependence when a variable is read before being written, and 
output dependence when a variable is written twice. Here is a A more formal definition £3^is 

30 provided in Hans Zima et al., supra and is presented below . 

Then S' depends on S, noted S .dclta. _d_S' iff: 
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Definition 

[0201] Let S and S' be 2-two statements , thcn^ _ 
[0205] (1) S is executed before S' 
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[0206] (2) .E backward. v.oppilon.VAR:v.QPBilon. ?v ? E VAR : v ? DEF(S)I 

USE(S') \ .opoilon.USE(S)I DEF(S'V.opoilon.DEF(S')I DEF(S') v v ? USE(S)I DEF(S^ v v ? 

DEF(S)I DEF(S') 

[0207] (3) There is no^ statement T such that S is executed before T and T is executed before S', 
5 and v.opsilon. ? DEF(T) * 

[0208] Whoro w here VAR is the set of the variables of the program, DEF(S) is the set of the 
variables defined by instruction S, and USE(S) is the set of variables used by instruction S. 

[0209] Moreover, if the statements are in a loop, a dependence can be loop- independent or 
10 loop- carried. This notion introduces the definition of the distance of a dependence. When a 
dependence is loop- independent it means that a it occurs between two instances of different 
statements in the same iteration, and then its distance is equal to zero. On the contrar y O. By 
contrast, when a dependence is loop carried, it occurs between two instances in two different 
iterations the dopcndcncc is loop carried , and thefts distance is equal to the difference between 
15 the iteration numbers of the two instances. 

[0210] The notion of direction of dependence generalizes the notion of distance, and is 
generally used when the distance of a dependence is not constant, or cannot be computed with 
precision. The direction of a dependence is given by <- if the dependence between S and S' 
20 occurs when the instance of S is in an iteration before the iteration of the instance of S', = if the 
two instances are in the same iteration, and > if the instance of S is in_an iteration after the 
iteration of the instance of S'. 

[0211] In the case of a loop nest we have then , there are distance and direction vector, with one 
25 element for each level of the loop nest. The examples below illustrate all these definitions. The 
data dependence graph is may be used by a lot of optimizations, and i smay also be useful to 
determine if their application is valid. For instance^, a loop can be vectorized if its data 
dependence graph does not contain any cycle. 

30 [0212] Example of a true dependence with distance 0 on array a: TABLE US 00001 
for (i=0; i<N; i=i+l) { 
S: a[i] = b[i] + l; 
SI: c[i] = a[i] + 2; 

} 
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[0213] Example of an anti-dependence with distance 0 on array b: TABLE US 00002 
for(i=+0;i<N;i=i+l) { 
S: a[i] = b[i] + l; 
SI i_b[i] = c[i] + 2; 

} 

[0211] Example of an output dependence with distance 0 on array a: TABLE US 00003 
for (i=0; i<N; i=i+l) { 
S: a[i] = b[i] + l; 
Sl: .sup. a[i] = c[i] + 2; 

} 



[0215] Example of a dependence with direction vector (=,=) between S 1 and S2 and a 
dependence with direction vector (=,=,<) between S2 and S2: TABLE US 00001 
for(j=0;j<=N;j++) 

for (i=0; i<=N; i++) { 



SI: c[i][j] = 0; 

for fefTk=0; k<-N:k++) k<=N; k++) 



S2: c[i]m = c[i]D'] + a[i][k]*b[k]D']; 

} 

[0216] Example of an anti-dependence with distance vector (0,2). TABLE US 00005 
for (i=0; i<=N; i++) 

for 0=0; j<=N; j++) S: j ±±) 
S: a[i]D] = a[i]D+2] + b[i]; 

3.1.3 Interprocedural Alias Analysis 

[0217] The An aim of alias analysis is to determine if a memory location is accessible aliased by 
several objects, lik-e e.g., variables or arrays, in a program. It has may have a strong impact on 
data dependence analysis and on the application of code optimizations. Aliases can occurs 
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[0218] with statically allocated data, like unions in C where all fields refer to the same memory 
area, or [0219] with dynamically allocated data, which are the usual targets of the analysis^-©* 
[0220] with pointers referencing static data, like in C . [0221] In the following example, we have 
aA typical case of aliasing where p aliases b. alias b is: 
5 [0222] Example for typical aliasing: TABLE US 00006 intb[100], *p; 
for (p=b;p < &b[100];p++) 
*p=0; 

[0223] Alias analysis can be more or less precise depending on whether or not it takes the 
1 0 control-flow into account. When it does, it is called flow-sensitive, and when it does not, it is 
called flow- insensitive. Flow-sensitive alias analysis is able to detect in which blocks along a 
path two objects are aliased. As it is more precise, it is more complicated and more expensive 
to compute. Usually flow- insensitive alias information is sufficient. This aspect is illustrated 
in F4G Fig . 8 where a flow-insensitive analysis would find that p alias b, but where a flow- 
15 sensitive analysis would be able to find that p alias b only in block B2. 

[0221] Furthermore^ aliases are classified into must-aliases and may-aliases. For instance, if wo 
consido r considering flow-insensitive may-alias information,-tfeee x alias y, iff x and y may, 
possibly at different times, refer to the same memory location. And if wo conside r Considering 

20 flow-insensitive must-alias information, x alias y, iff x and y must, throughout the execution of 
a procedure, refer to the same storage location. In the case of FTGFig. 8, if we consider flow- 
insensitive may-alias information is considered , p alias b holds, whereas if wo consider flow- 
insensitive must-alias informatio n is considered , p alias b does not hold. The kind of 
information to use depends on the problem to solve. For instance, if we want to remove removal 

25 of redundant expressions or statements is desired , must-aliases have to must be used, whereas if 
wo want to build of a data dependence graph is desired, m ay-aliases are necessary. 

[0225] Practically, as exact alias information is hard to compute, the analysis is rather used to 
be sure that two objects are not aliased. Finally this analysis must be interprocedural to be able 
30 to detect aliases caused by non-local variables and parameter passing. The latter case is 

depicted in the code below, which is an example bele wfor aliasing parameter passing, where i 

and j are aliased through the function call where k is passed twice as parameter. 

[0226] Example for aliasing by parameter passing: TABLE US 00007 void foo (int *i, int* j) 

{ 
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*i= *j+l; 

} 



foo (&k, &k); 

5 

3.1.1 Interprocedural Value Range Analysis 

[0227] This analysis can find the range of values taken by the v ariables. It can help to apply 
optimizations like dead code elimination, loop unrolling and others. For this purpose^ it can use 
information on the types of variables and then consider operations applied on these variables 
10 during the execution of the program. Thus^ it can determine.! for instance.! if tests in conditional 
instructions instruction are likely to be met or not, or determine the iteration range of loop nests. 

[0228] This analysis has to be interprocedural as^ for instance.! loop bounds can be passed as 
parameters of a function, -hk-eas in the following example. Wo kno w It is known by analyzing 
1 5 the code that in the loop executed with array 'a', N is at least equal to 1 1 , and that in the loop 
executed with array V, N is at most equal to 10. TABLE US 00008 

void foo (int *c, int N) 

{ 

20 int i; 

for (i=0O; i<N; i++) 
c[i] = g(i,2); 

} 

25 if(N>10) 

foo (a,N); 

else 

foo (b,N); 

30 [0229] The programmer can support value range analysis by statin g can be supported by the 
programmer by giving further value constraints which cannot be retrieved from the language 
semantics. This can be done by pragmas or by-a compiler known assert function. 

3.1.5 Alignment Analysis 
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[0230] Alignment analysis deals with data layout for distributed memory architectures. As 
stated by Saman Amarasinghe^ "Although data memory is logically a linear array of cells, its 
realization in hardware can be viewed as a multi-dimensional array. Given a dimension in this 
array, alignment analysis will identify memory locations that always resolve to a single value in 
5 that dimension. For example, if the dimension of interest is memory banks, alignment analysis 
will identify if a memory reference always accesses the same bank-. ^_This is the case in the 
right hal f second part of FIG. 9, that can be found in [10], where all accesses, depicted in 
gfey Fig. 9, which is a reproduction of a figure that can be found in Sam Larsen, Emmet Witchel 
& Saman Amarasinghe, "Increasing and Detecting Memory Address Congruence," Proceedings 

10 of the 2002 IEEE International Conference on Parallel Architectures and Compilation 

Techniques (PACT02), 18-29 (September 2002). All accesses, depicted in dark squares , occur 
to the same memory bank, whereas in the left hal f first part, the accesses are not aligned.-He 
Saman Amarasinghe adds then mate "Alignment information is useful in a variety of compiler- 
controlled memory optimizations leading to improvements in programmability, performance, 

15 and energy consumption." 

[0231] Alignment analysis, for instance, is able to help find a good distribution scheme of the 
data and is furthermore useful for automatic data distribution tools. An automatic alignment 
analysis tool can be able to automatically generate alignment proposals for the arrays accessed 
20 in a procedure and thus simplifies the data distribution problem. This can be extended with an 
interprocedural analysis taking into account dynamic realignment. 

[0232] Alignment analysis can also be used to apply loop alignment that transforms the code 
directly rather than the data layout in itself, as shown later. discussed below. Another solution 

25 can be used for the PACT XPP, relying on the fact that it can handle aligned code very 

efficiently. It consists i n includes adding a conditional instruction testing if the accesses in the 
loop body are aligned followed by the necessary number of peeled iterations of the loop body, 
then the aligned loop body, and then some compensation code. Only the aligned code is then 
executed by the PACT XPPr-th e. The rest i smay be executed by the host processor. If the 

30 alignment analysis is more precise (inter-procedural or inter-modular^ less conditional code has 
to be inserted. 

3r2-Code Optimizations 
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|"02331 Mos t Discussion regarding many of the optimizations and transformations prosontod 
hef ediscussed below can be found in detail in HI, and also in T2,3,51. David F. Bacon, Susan L. 
Graham & Oliver J. Sharp, "Compiler Transformations for High-Performance Computing," 
ACM Computing Surveys, 26(4):325-420 (1994); Michael Wolfe, supra; Hans Zima et al., 
5 supra; and Steven Muchnick. supra. 

3.2.1 General Transformations 

|"02311 We present in this sectio n Discussed below are a few general optimizations that can be 
applied to straightforward code^ and to loop bodies. These are not the only ones that appear in a 
1 0 compile r, but thoy are mentioned in the sc - quol of this document. . 



|"02351 This optimization propagatos A constant propagation may propagate the values of 
constants into the expressions using them throughout the program. This way a lot of 
1 5 computations can be done statically by the compiler, leaving less work to be done during the 
execution. This part of the optimization is also known as constant folding. 



Constant Propagation 



20 



[0236] Examplo An example of constant propagation is: TABLE US 00009 
N = 256; fo r(i=Q; i<=256; 

c = 3; am =b Til + 3; 

for_(i=0; i<= 256; i++) c ~ 3; a[i] - b[i] + 3; for(i~0;i <~ N ; i++) a[i] - b[i] + c; 



am = bfil + c; 



25 



Copy Propagation 



[0237] This A copy propagation optimization simplifics may simplify the code by removing 
redundant copies of the same variable in the code. These copies can be produced by the 
programmer or by other optimizations. This optimization roduccs may reduce the register 
pressure and the number of register-to-register move instructions. 



30 



[0238] Examplo An example of copy propagation is : TABLE US 000 1 0 

t = i*4; t = i*4; r - t; for(i~0;i <- N;i++) for(i~0;i <- N;i++) a[t] - b[t] + a[i]; a[r] - b[r] 

r = t; for (i=0; i<=N; i++) 
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for (i=0; i<=N; a[t] = bjt] + aril; 



aM = bfrl + aril; 



Dead Code Elimination 

5 r0^391 This A dead code elimination optimization rcmovos may remove pieces of code that will 
never be executed. Code is never executed if it is in the branch of a conditional statement 
whose condition is always evaluated to true or false, or if it is a loop body, whose number of 
iterations is always equal to zero. The latter implies that this optimization relies also on value 




[0210] Code updating variables^ that are never used , are is also useless and can be removed as 
well. If a variable is never used, then the code fer-updating it and its declaration can also be 
eliminated. 



15 [021 11 Example An example of dead code elimination is: TABLE US 0001 1 
for (i=0;i i<= N; i++) { for (i=0; i <~ N;i++M i<=N: 

if(i>N) for0'=0;j<10;j++) 

for (j=0; j<10;j++) a[j+l j<10; a[j±l] = a[j] + b[j]; 

aD'] = bD'] + a[i]; } 

20 else 

for(j=0;j<10;j++) 

ab'+l] = a[j] + b[j]; 



25 Forward Substitution 

[02^12] This A forward substitution optimization is a generalization of copy propagation. The 
use of a variable is may be replaced by its defining expression. It can be used for simplifying 
the data dcpcndcncc dependencv analysis and the application of other transformations by 
making the use of loop variables visible. 
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r0213l Examplo An example of forward substitutionjs: TABLE US 00012 c ~ N + 1 ; for(i~0; 
i<- N; i++) for(i~0;i <- N;i++) a[N+l] - b[N+l] + a[i]; a[c] - b[c] + a[i]; 
c - N + 1 ; for (i=0; i<=N: 

for (i=0; i<= N; j±±) arN+11 = brN+11 + aril; 
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a[cl = b[cl + afil ; 



Idiom Recognition 

r02 / l / 11 This An idiom recognition transformation recognizes m ay recognize pieces of code and 
5 can replace them by calls to compiler known functions, or less expensive code sequences, like 
code for absolute value computation. 

r0215l Examplc An example of idiom recognitionjs: TABLE US 00013 for(i~0; i<N; 
for(i~0; i<N; i++) [ c - a[i] b[i]; c - a[i] b[i]; 
10 for (i=0; i<N; ( for (i=0; i<N; ( 

c = afil - b|Yl; c = a|Tl - bfil; 

if (c<0) c = abs(c); 

c = -c; d[i] = c; 

d[i] = c; n 

15 1 

3.2.2 Loop Transformations 

Loop Normalization 

[02-16" ! This A loop normalization transformation onsuros may ensure that the iteration space of 
20 athe loop ba sis always with a lower bound equal to 0 or 1 (depending on the input language), 
and an increment with a step of 1 . The array subscript expressions and the bounds of the loops 
are modified accordingly. It can be used before loop fusion to find opportunities, and ease 
inter-loop dependence analysis, and it also enables the use of dependence tests roquirin g that 
need a normalized loops, loop to be applied: 

25 

[02^17] Examplo An example of loop normalization is: TABLE US 0001 ^1 

for (i=2; i<N; i=i+2) for (i=0; i<(N-2)/2; i++) 

a[i] = b[i]; a[2*i+2] = b[2*i+2]; 

30 Loop Reversal 

[02-181 This A loop reversal transformation changos may change the direction in which the 
iteration space of a loop is scanned. It is frequentl y usuallv used in conjunction with loop 
normalization and other transformations, like loop interchange, because it changes the 
dependence vectors. 
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[02^19] Examplo An example of loop reversal is: TABLE US 00015 
for (i=N; i>=0; i-) for (i=0; i<=N; i++) 
a[i] = b[i]; a[i] = b[i]; 



10 



Strength Reduction 

r02501 This A strength reduction transformation replaces may replace expressions in the loop 
body by equivalent but less expensive ones. It can be used on induction variables, other than 
the loop variable, to be able to eliminate them. 



An example of strength reduction is : 

[0251] Example of Strength Reduction: TABLE US 00016 for (i=0; i<N; i++) t = c; 
a[i] = b[i] + c*i; for (i=0; i<N; i++){ 
a[i] = b[i] + t; 
15 t = t + c; 

} 



Induction Variable Elimination 
[0252] This An induction variable elimination transformation can use strength reduction to 
20 remove induction variables from a loop, hence reducing the number of computations and easing 
the analysis of the loop. This may also removes remove dependence cycles due to the update of 
the variable, enabling vectorization. 



An example of induction variable elimination is: 

[0253] Example of Induction Variable Elimination: TABLE US 00017 for (i=0; 

i<=N; i++) { 

for (i=0; i<=N; i++){ k - k + 3; a[i] - b[i] + a[k+(i+l)*3]; a[i] - b[i] + a[k]; ] ) k - k 

30 +(N+1)*3; afil = bfil + ark+(i+l)*31: 

k = k+3: 1 
am=b[il + a[kl; 

1 

k = k + (N+l)*3; 
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Loop-Invariant Code Motion 
|"02511 This A loop-invariant code motion transformation moves may move computations outside 
a loop if their result is the same in all iterations. This allows to reduce may allow a reduction of 
5 the number of computations in the loop body. This optimization can also be conducted in the 
reverse dircctio n fashion in order to get perfectly nested loops, that are easier to handle by other 
optimizations. 



[0255] Examplo An example of loop-invariant code motion is: TABLE US 00018 

1 0 for (i=0; i<N; i++) if (N >= 0) 

a[i] = b[i] + x*y; c = x*y; 

for (i=0; i<N; i++) 

a[i] = b [i] + c; 



1 5 Loop Unswitching 

r02561 This A loop unswitching transformation moves may move a conditional instruction 
etrt outside of a loop body if its condition is loop- invariant. The branches of the ne^condition 
contai n may then be made of the original loop with the appropriate statements from the original 
condition. Loop unswitching allows original statements of the conditional statement. It may 

20 allow further parallelization of the loop by removing control-flow code from in the loop body 
and also removing unnecessary computations from it . 



25 



3 0 r02571 Examplo An example of loop unswitchingis: TABLE US 00019 for(i~0; KN; i++) [ if 
(x > 2) a[i] ~ b[i] + 3; for(i-0; i<N; i++) [ if (x > 2) a[i] - b[i] + 3; b[i] ~ c [i] + 2; b[i] - c[i] + 
2; else ] b[i] ~ c[i] 2; else ] for(i~0; i<N; i++) ( a[i] - b[i] + 3; b[i] ~ c[i] 2; ) 
for (i=0; i<N; i++) ( if (x > 2) 

am=bm + 3; for (i=0; i<N; 
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if(x>2) 



am = bril + 3; 



bm = cm + 2; 



bm = cm +2; 



else 



1 



bm=cm - 2; else 



for (i=0; i<N: 



am = bfil + 3: 



bm = cfil - 2; 



1 0 If-Conversion 

[02581 This An if-conversion transformation i smay be applied teon loop bodies with conditional 
instructions. It changes may change control dependences dependencies into data dependences 
and enables a subsoquont dependencies and allow then vectorization r to take place. It can be 
used in conjunction with loop unswitching to handle loop bodies with several basic blocks. The 

15 conditions^ where array expressions could appear^-are may be replaced by boolean terms called 
guards. Processors with predicated execution support can execute directly execute such code, 
and configurable hardware can use the result of guards to direct dataflow through different 
branches by means of multiplexors and demultiplexers, such code. 

20 r02591 Examplc An example of if-conversion is: TABLE US 00020 for(i - 0;i < N; f for(i 

- 0;i < N;i++) ( a[i] - a[i] + b[i]; a[i] - a[i] + b[i]; if (a[i] !~ 0) c2 - (a[i] !~ 0); if (a[i] > c[i]) if 

(c2) d ~ (a[i] > c[i]); a[i] ~ a[i] 2; if (c2 && d) a[i] - a[i] 2; 

for (i=Q; i<N; ( for (i=0; i<N; ( 

am = afil + bril; am = am + bfil; 

25 if (am != 0) c2 = (am != 0); 

if (am > cm) if(c2) c4 = (alii > cliT): 

am = am - 2; if (c2 && c4) a[i] = aji] - 2; 

else if (c2 && ! c4) a[i] = a[i] + 1 ; 

a[i] = a[i] + 1 ; d[i] = a[i] * 2; d[i] ~ a[i] * 2; ] ] 

30 dm = am * 2; 1 

1 



Strip-Mining 
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|"02601 This A strip-minine transformation onablc - s to adjust may enable adjustment of the 
granularity of an operation. It is commonly used to choose the number of independent 
computations in the inner loop nest. When the iteration count is not known at compile time, it 
can be used to generate a fixed iteration count inner loop satisfying the resource constraints. It 
5 can be used in conjunction with other transformations like loop distribution or loop interchange. 
It is also called loop sectioning. Cycle shrinking, also called stripping, is a specialization of 
strip-mining. 



[0261] Examplo An example of strip-mining is: TABLE US 00021 

10 for (i=0; i<N; Ii++) up = (N/16)*16; 

a[i] = b[i] + c; for(i=0; i<up; i = i + 16) 

for0'=i;ii<^i6;j++) 

a&"] = t>D'] + c; 
forO'=i+l;j<N;j++) 
15 a[i] = b[i] + c; 



[0262] This A loop tiling transformation modifios may modify the iteration space of a loop nest 
by introducing loop levels to divide the iteration space in tiles. It is a multi-dimensional 
20 generalization of strip-mining. It is generally used to improve memory reuse, but can also 

improve processor, register, translation lookaside buffer (TLB) TLB , or page locality. It is also 
called loop blocking. 

[0263] The size of the tiles of the iteration space i smay be chosen such so that the data needed in 
25 each tile fits into fit in the cache memory, thus reducing the cache misses. In the case of coarse- 
grain computers, the size of the tiles can also be chosen such so that the number of parallel 
operations of the loop body matchos fits the number of processors of the computer. 



[0261] Example of loop tiling: TABLE US 00022 for(i~0; i<N; i++) for(ii~0; ii<N; ii ~ ii+16) 
30 forQ-0; j<N; j++) for(ij-0; jj<N; jj - jj+16) a[i]\j] ~ b[j][i]; for(i~ii; i< min(ii+15,N); i++) 
for(j-ji; \< mm(ii+15JSP); i++) aril Til ~ b[i1[i]; An example of loop tiling is: 
for (i=0; i<N; i++) for (ii=Q; ii<N; ii = ii+16) 



Loop Tiling 



for (j=0; i<N; \++) 



for (ii=0; ii<N; ii = ii+16) 



a[il[il =b [il[il; 



for (i=ii; i<min(ii+15,N); 
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for (i=ii; i<min(ii+15.N); 



am r.ii =b [i i m; 



Loop Interchange 



5 [0265] This transformation interchanges loop levels of a nest in order to change data 

dependences. It can: [0266] A loop interchange transformation may be applied to a loop nest to 
move inside or outside (depending on the searched effect) the loop level containing data 
dependencies. It can: 

* enable vectorization by intorchanging moving inside an independent loop wi feand outside a 
10 dependent loop, or [0267] 

* improve vectorization by pushin g moving inside the independent loop with the largest range 
further inside, or [0268]^ 

* deduce the stride, or [0269] 

* increase the number of loop-invariant expressions in the inner-loop, o r [0270] 

15 * improve parallel performance by moving an independent loop outside of a loop nest to 

increase the granularity of each iteration and reduce the number of barrier synchronizations. 

[0271] Examplc An example of ajoop interchange is: TABLE US 00023 for(i~0; i<N; 



forQ-0; j<N; j++) for(j-0;j<N; j++) for(i-0; i<N; i++) a[i] - a[i] + b[i][j]; a[i] ~ a[i] + b[i]\j]; 
20 for (i=0; i<N; i++) for (j=0; i<N; 



25 [0272] This A loop coalescing / collapsing transformation combines m ay combine a loop nest 
into a single loop. It can improve the scheduling of the loop, and also reduces the loop 
overhead. Collapsing is a simpler version of coalescing in which the number of dimensions of 
arrays is reduced as well. Collapsing rcduccs may reduce the overhead of nested loops and 
multi dimensional multidimensional arrays. Collapsing can be applied to loop nests that iterate 

30 over memory with a constant stride . Otherwise , otherwise loop coalescing is may be a better 

approach. It can be used to make vectorizing profitable by increasing the iteration range of the 
innermost loop. 



for (i=0; i<N; \++) 



for (i=Q; i<N; i++) 



a[i1-a[i1 + b[il [jl; 



a[i1-a[i1 + b[il [ jl; 



Loop- Coalescing / Collapsing 



[0273] Examplc An example of loop coalescin g is : TABLE US 0002^1 
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for (i=0; i<N; i++) for (k=0O; k<N*M; k++) { 

for (j=0; j<M; j++) i = ((k-l)/m)*m + 1; 

a[i]D'] = a[i]D'] + c; j = ((T-l)%m) + 1; 

a[i]D'] = a[i]D'] + c; 

5 } 



Loop Fusion 

|"02711 This A loop fusion transformation, also called loop jamming or loop merging, merges 2^ 
may merge two successive loops. It roducos may reduce loop overhead, increases instruction- 
1 0 level parallelism, improves register, cache, TLB or page locality, and improves the load balance 
of parallel loops. Alignment can be taken into account by introducing conditional instructions 
to take care of dependences, dependencies. 

|"02751 Examplo An example of loop fusion is: TABLE US 00025 
for (i=0; i<N; i++) for (i=0; i<N; i++) { 
15 a[i] = b[i] + c; a[i] = b[i] + c; 

dm = eJTJ + c: 

for (i=0; i<N; i++) d[i] ~ o[i] + c; d[i] ~ o[i] + c; ) J _ 
dm = efil + c: 



20 Loop Distribution 

[02761 This A loop distribution transformation, also called loop fission, allows splittin g may 
allow to split a loop in several pieces in case the loop body is too big, or because of 
dopondoncos. dependencies. The iteration space of the new loops i smay be the same as the 
iteration space of the original loop. Loop spreading is a more sophisticated distribution. 

25 

[0277] Examplo An example of loop distribution is: TABLE US 00026 
for (i=0; i<N; i++) { for (i=0; i<N; i++) 

a[i] = b[i] + C ; afil - Mil + c; aji] = bfil + c: 
d[i] = e[i] + c; for(i~0;i<N; i++) ] 

30 } for ft=0; i<N; 

d[i] = e[i] + c; 
Loop Unrolling / Unroll-and-Jam 
[02781 This A loop unrolling / unroll-and-jam transformation replicates may replicate the 
original loop body in order to get a larger one. A loop can be unrolled partially or completely. 
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It i smay be used to get more opportunity for parallelization by making the loop body bigger^^ 
It may also improvoo improve register^ or cache usage and reduces loop overhead. Unrolling 
Loop unrolling the outer loop followed by merging the induced inner loops is referred to as 
unroll-and-j am. 
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r02791 Exomplc An example of loop unrollingjs: TABLE US 00027 
for (i=0; i<N; i++) for (i=0; i<N; i = i+2){ 

a[i] = b[i] + c i_a[i] = b[i] + c; a[i+l] ~ b[i+l] + c; ) 
15 afi+11 = bfi+11 + c; 

I 

if ((N-l)%2) == 1) 

a[N-l] =b[N-l] + c; 
Loop Alignment 

20 [02801 This A loop alignment optimization transforms m ay transform the code to achieve get 

aligned array accesses in the loop body. The application of loop alignment transforms Its effect 
may be to transform loop-carried dopondoncos dependencies into loop-independent 
dopondoncos dependencies , which allows oxtracting for extraction of more parallelism from a 
loop. It uses a combination of othe r It can use different transformations, like loop peeling or 

25 introduces introduce conditional statements . Loop alignment , to achieve its goal. This 

transformation can be used in conjunction with loop fusion to aMg nenable this optimization by 
aligning the array accesses in both loop nests. In the example below, all accesses to array 'a' 
become aligned. 

3 0 r02811 Examplo An example of loop alignmentjs: TABLE US 00028 
for (i=2; i <= N; i++) { for (i=l ; i<=N; i++) { 

a[i] = b[i] + c[i]; UJf (i>l) a[i] = b[i] + c[i]; 
d[i] = a[i-l] * 2; if (i<N) d[i+l] = a[i] * 2; 

e[i] = a[i-l] + d[i+l]; if (i<N) e[i+l] = a[i] + d[i+2];-}-4 
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Loop Skewing 

|"02821 This A loop skewing transformation i smay be used to enable parallelization of a loop 
5 nest. It is -may be useful in combination with loop interchange. It i smay be performed by adding 
the outer loop index multiplied by a skew factor, f, to the bounds of the inner loop variable, and 
then subtracting the same quantity from every use of the inner loop variable inside the loop. 



r02831 Examplo An example of loop skewingis: TABLE US 00029 for(i~l; i <~N; i++) 
10 for(i-l; i <- N; i++) fbr(j 1 ;j <~ N; j++) for(j-i+l;j <~ i+N; j++) a[i] - a[i+j] + c; a[i] - a[j] + 

ej 

for (i=l; i<=N; i++)f for (i=l; i<=N; i++H 

for (j=l; i<=N; for (j=i+l; j<=i+N; 

a|T| = a[i+j1 + c; a[i] = a|~j] + c; 

15 

Loop Peeling 

[0281] This A loop peeling transformation rcmovcs may remove a small number of 
starting b eginning or closing ending iterations of a loop to avoid dependences in the loop body. 
These removed iterations age may be executed separately. It can be used for matching the 
20 iteration control of adjacent loops to enable loop fusion. 



r02851 Examplo An example of loop peelingjs: TABLE US 00030 for(i~0; i<~N; i++) a[0][N] 
- a[0][N] + a[N][N]; a[i][N] ~ a[0][N] + a[N][N]; for(i~l;i<-N 1; i++) a[i][N] ~ a[0][N] + 
a[N][N]; a[N][N] - a[0][N] + a[N][N]; 

25 for (i=Q; i<=N; i++) arOirNl = arOirNl + afNirNl; 

aJTjrN] = afOirNl + afNirNI; for (i=l; i<=N-l; 

aJTJTN] - arOlfNl + afNirNl: 

afNirNl = afOirNl + afNirNl: 



30 Loop Splitting 

[0286] This A loop splitting transformation eats -may cut the iteration space in pieces by creating 
other loop nests. It is also called Index Set Splittings and is generally used because of 
dependences dependencies that prevent parallelization. The iteration space of the new loops 
is -may be a subset of the original one. It can be seen as a generalization of loop peeling. 
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I"02871 Examplo An example of loop splittingjs: TABLE US 0003 1 for(i~0; i<~N; i++) 
for(i-0;i<(N+l)/2; i++) a[i] - a[N i+1] + c; a[i] - a[N i+1] + c; for(i~ (N+l)/2;i <~ N;i++) a[i] 
- a[N 1+1] + c; 
5 for (i=Q: i<=N: for (i=0: i<(N+l)/2: 

afil = afN-i+11 + c: a[i] = afN-i+11 + c: 

for (i = (N+1V2; i<=N: i++) 

afil = afN-i+11 + c; 

10 Node Splitting 

[02881 This A node splitting transformation spiite may split a statement in pieces. It is may be 
used to break dependence cycles in the dependence graph due to the too high granularity of the 
nodes, thus enabling vectorization of the statements. 



15 



20 r02891 Examplc An example of node splittingis: TABLE US 00032 

for (i=0;i i< N; i++) ( for(i - 0,i < N;i++) [ b[i] ~ a[i] + c[i] * d[i] ; tl[i] ~ c[i] * d[i]; a[i+l] - 
b[i] * (d[i] c[i]); t2[i] - d[i] c[i]; ) b[i] ~ a[i] + tl [i]; a[i+l] - b[i] * t2[i]; ] { for (i=0; i<N; 

i±±U 

brn = am + cm * dm; tim = cm * dm; 

25 afi+11 = bm * (d\U - cm); t2|Yl = dm - clil: 

} bm =am + tim; 

afi+11 -bfil * L2JTU 

I 



30 Scalar Expansion 

[0290] This A scalar expansion transformation roplacos may replace a scalar in a loop by an 
array to eliminate dependences dependencies in the loop body and enables enable parallelization 
of the loop nest. If the scalar is used after the loop, a_compensation code must be added. 
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|"02911 Examplo An example of scalar expansion is: TABLE US 00033 

for (i=0; i<N; i++){ for(i~0;i<N; i++) ( c ~ b[i]; tmp[i] ~ b[i]; a[i] - a[i] + c; a[i] ~ a[i] + tmp[i]; 
n c ~ tmpfN 11; for (i=0; i<N; i++)( 

c = b|T|; tmpTil = bfil; 

5 a[i] = a|~i] + c; a[i] = a|~i] + tmpfi]; 

} I 

c = tmpnSf-11: 



Array Contraction / Array Shrinking 
1 0 [0292] This An array contraction / array shrinking transformation is the reverse transformation 
of scalar expansion. It may be needed if scalar expansion generates too many memory 
requirements. 



[02931 Examplo An example of array contractionis: TABLE US 00031 for(i~0; i<N;i++) 
15 for(i-0; i<N;i++) for(j-0; j<N;j++) { for q~0; j<N;j++) { t[i][j] ~ a[i][j] * 3; t\j] ~ a[i][j] * 3; 

b[i]D] ~ t[i]M + cDl; b[i]Dl - tm + cDl; ) ) 

for (i=0; i<N: i++^ for (i=0: i<N: i++^ 

for(i=0;i<N;i++){ for (i=0: i<N: 

tmm = am r,ir 3; t m = ariim*3; 

20 bmm = triim + cni ; briim=tm + cm: 

) i 



25 

Scalar Replacement 

[02911 This A scalar replacement transformation roplacos may replace an invariant array 
reference in a loop by a scalar. This array element is may be loaded in a scalar before the inner 
loop and stored again after the inner loop-j if it is modified. It can be used in conjunction with 
30 loop interchange. 

[02951 Example An example of scalar replacement is: TABLE US 00035 
for (i=0; i<N; i++) for (i=0; i<N; i++) { 

for 0=0; j<N; j++) tmp = a[i]; 
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a[i] = a[i] + b[i]ft; 



for (j=0; j<N; j++) tmp - tmp + b[i][j]; a[i] 



tmp; \ \++) 



tmp = tmp + brilh'l; 



aril = tmp; 



5 



Reduction Recognition 



[02961 This A reduction recognition transformation allows may allow handling of reductions in 
loops. A reduction i smay be an operation that computes a scalar value from arrays. It can be a 
10 dot product, the sum or minimum of a vector for instance. The A goal is then to perform as 
many operations in parallel as possible. One way is may be to accumulate a vector register of 
partial results and then reduce it to a scalar with a sequential loop. Maximum parallelism i smay 
then be achieved by reducing the vector register with a treer , i.e., pairs of o - lomonts dements are 
summed-?! then pairs of these results are summed;; etc. 



r02971 Example An example of reduction recognition is: TABLE US 00036 for(i~0; i<N;i++ s ) 
for(i~0; i<N; i~i+61) a ~ a + a[i]; tmp[0:63] - tmp[0:63] + a[i:i+63]; for(i~0; i<61;i++) a - a + 
tmp[i]; 

for (i=0; i<N; for (i=0; i<N; i=i+64) 

20 s = s + aril; tmt>r0:631 = tmr>r0:631 + a|i:i+631; 



Loop Pushing / Loop Embedding 
[02981 This A loop pushing / loop embedding transformation replaces may replace a call in a 
25 loop body by the loop in the called function. It i smay be an inter procedural interprocedural 



the overhead caused by the procedure call. Loop distribution can be used in conjunction with 
loop pushing. 



for (i=Q; i<64;i++) 



s = s + tmp|"i"|; 



optimization. It allows may allow the parallelization of the loop nest and 



^eliminate 



30 



r02991 Examplo An example of loop pushingjs: TABLE US 00037 

for (i=0; i<N; i++) f2(x) f(x,i); void f2(int* a) ( void f(int* a,int j) ( 

a*^ 

void f2(int* a){ 
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void f(int* a, int i){ 

ah'l = ah'1 + c; 

) I 



for (i=0; i<N; i++) a[j] = ap] + c; a[i] ~ a[i] + c; ) ) 
aril = aril + c; 



5 Procedure Inlining 

[0300] This A procedure inlining transformation replaces a call to a procedure by the code of the 
procedure itself. It is an inter procedural interprocedural optimization. It allows a loop nest to 
be parallelized, removes overhead caused by the procedure call, and can improve locality. 

1 0 [0301] Example An example of procedure inlining is: TABLE US 00038 
for (i=0; i<N; i++) for(i=0; i<N; i++) 

f(a,i); a[i] = a[i] + c; 

void f(int* x, int j){ 

xD"] = X D'] + c; 

15 } 

Statement Reordering 

[0302] This A statement reordering transformation schedules instructions of the loop body to 
modify the data dependence graph and honco onablos enable vectorization. 

20 

r03031 Example An example of statement reorderingis: TABLE US 00039 for(i~0;i < N;i++) { 
for(i~0; i<N; i++) ( a[i] - b[i] * 2; c[i] ~ a[i 1] 1; c[i] ~ a[i 1] 1; a[i] - b[i] * 2; ] ] 
for (i=Q; i<N; i++^ ( for(i=0; i<N; i++^ ( 

am = bm^2; c m = ari-ll-4; 

25 cm = ari-11 - 4; am = Mil * 2; 

) I 

Software Pipelining 

r0301l This A software pipelining transformation parallelizes may parallelize a loop body by 
30 scheduling instructions of different instances of the loop body. It is- may be a powerful 

optimization to improve instruction-level parallelism. It can be used in conjunction with loop 
unrolling. In the example below, the preload commands can be issued one after another, each 
taking only one cycle. This time is just enough to request the memory areas. It is not enough to 
actually load them. This takes many cycles, depending on the cache level that actually has the 
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data. Execution of a configuration behaves similarly. The configuration is issued in a single 
cycle, waiting until all data are present. Then the configuration executes for many cycles. 
Software pipelining overlaps the execution of a configuration with the preloads for the next 
configuration. This way, the XPP array can be kept busy in parallel to the Load/Store unit. 



10 



15 



[0305] Example An example of software pipelinin g is : TABLE US 00010 
Issue Cycle Command 

XPPPreloadConfig XppPrcloadConfig (CFG 1 ); 

for(i=0; i<100; ++i) { 
1: XppProload XPPPreload (2.a+10*i.l0); 
XppPrcload XPPPreload (5.b+20*i,20); 



// dela y 5:6: XppExocuto( ); ) 



XPPExecute (CFGH; 



i 



20 



25 



Issue Cycle Command 

Prologue XppProloadConfi g XPPPreloadConfig (CFG n ; 

XPPPreload XppPrcload (2,a,10); 

XPPPreload XppPrcload (5,b,20); 

// delay 

for(i=l; i<100; ++i) { 
Kernel 1 : XppExecutef ); 2: XppPreloa d 1 : 

2: XPPPreload (2,a+10*i,10); 

3: XppProload XPPPreload (5.b+20*i,20); 

4: } XppExocutc( ); 
XPPExecute (CFGn: 



Epilog // delay 



XPPExecute (CFGH; 



30 



Vector Statement Generation 
[03061 This A vector statement generation transformation replaces m ay replace instructions by 
vector instructions that can perform an operation on several data in parallel. This occurs at the 
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ond of tho voctorization process, and is only of interest if tho targot processor is a vector 
processor. 

|"03071 Example An example of vector statement generation is: TABLE US 0001 1 
5 for (i=0; i<=N; i++) a [0:N] = b[0:N];-a 
[i]=b[i]; 

3.2.3 Data-Layout Optimizations 

[0308] In tho following wo describe optimizations that Optimizations may modify the data 
1 0 layout in memory in order to extract more parallelism or prevent memory problems like cache 
misses. Examples of such optimizations are scalar privatization, array privatization, and array 
merging. 

Scalar Privatization 

1 5 |"03091 This A scalar privatization optimization i smay be used in multi-processor systems to 
increase the amount of parallelism and avoid unnecessary communications between the 
processing elements. If a scalar is only used like a temporary variable in a loop body, then each 
processing element can receive a copy of it and achieve its computations with this private copy. 

20 f03 1 01 Example fo rA n example of scalar privatization is: TABLE US 00012 
for (i=0;i i<= N; i++) { 
c = b[i]; 
a[i] = a[i] + c; 

} 

25 Array Privatization 

[03 1 1 1 This An array privatization optimization i smay be the same as scalar privatization except 
that it werks may work on arrays rather than on scalars. 

Array Merging 

30 [03 12] This An array merging optimization trans forms may transform the data layout of arrays 
by merging the data of several arrays following the way they are accessed in a loop nest. This 
way, memory cache misses can be avoided. The layout of the arrays can be different for each 
loop nest. In FIG. 10 the The example code for array merging presented below is an example of 
a cross-filte r is shown , where the accesses to array 'a' are interleaved with accesses to array br 
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Tho picture noxt to it represents tho 'b'. Fig. 10 illustrates a data layout of both arrays.! where 
blocks of a (gree n y (the dark highlighted portions ) are merged with blocks of b (yollow V b' (the 
lighter highlighted portions). Unused memory space is represented by the whiter portions. 
ThuSi cache misses ar emay be avoided as data blocks containing arrays 'a' and 'b' are loaded into 
5 the cache when getting data from memory. Details may be found in [1 1]. More details can be 
found in Daniela Genius & Sylvain Lelait. "A Case for Array Merging in Memory Hierarchies." 
Proceedings of the 9th International Workshop on Compilers for Parallel Computers. CPC'01 
(June 2001). 

1 0 3.2. A Example of Application of the Optimizations application of the optimizations 

|"03 131 A In accordance with that which is discussed above, it will be appreciated that a lot of 
optimizations can be performed on loops before and also after generation of vector statements. 
Finding a sequence of optimizations producin gt hat would produce an optimal solution for all 
loop nests of a program is still an area of research. Therefore wo propose , in an embodiment of 

1 5 the present invention, a way to use thethese optimizations is provided that follows a reasonable 
heuristic to produce vectorizable loop nests. To vectorize the code, we can use the Allen- 
Kennedy algorithm.! that uses statement reordering and loop distribution before vector 
statements are generated: , can be used. It can be enhanced with loop interchange, scalar 
expansion, index set splitting, node splitting, loop peeling. All these transformations are based 

20 on the data dependence graph. A statement can be vectorized if it is not part of a dependence 
cycle . Hence , hence optimizations af emay be performed to break cycles or, if not completely 
possible, to create loop nests without dependence cycles. Tho example presented below is 
intended as an illustration for tho use of tho optimizations presented before. 
[0311] The whole process i smay be divided ininto four majors steps. First A the procedures 

25 ar emay be restructured by analyzing the procedure calls inside the loop bodies and trying to 
remove them. Then . Removal of the procedures may then be tried. Then, some high-level 
dataflow optimizations ar emay be applied to the loop bodies to modify their control- flow and 
simplify the their code. The third step prcparcs m ay include preparing the loop nests for 
vectorization by building perfect loop nests and ensures ensuring that inner loop levels are 

30 vectorizable. Then target specific optimizations are applied whic h Then, optimizations can be 
performed that target the architecture and optimize the data locality .-?4ei e It should also be 
noted that other optimizations and code transformations may be applied can occur between these 
different steps that can also help to further 
optimize the loop nests . 
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|"03151 Tho Hence, the first step comprisos may apply procedure inlining and loop pushing to 
remove the procedure calls of the loop bodies.^T-h e Then, the second step consists o f may 
include loop-invariant code motion, loop unswitching, strength reduction and idiom recognition. 
5 The third step can be divided in several subsets of optimizations. Wo first apply loop Loop 
reversal, loop normalization and if-conversion to obtain may be initially applied to get 
normalized loop nests. This allows m ay allow building of the data dependence dependencv 
graph. If dependences Then, if dependencies prevent the loop nest to be vectorize d adequate^ 
transformations af emay be applied. If for For instance, dopondoncos if dependencies occur only 

10 on certain iterations, loop peeling or loop splitting can remove those dopondoncos. may be 

applied. Node splitting, loop skewing, scalar expansion or statement reordering can be applied 
in other cases .-Loq b Then, loop interchange moves m ay move inwards the loop levels without 
dependence cycles. The objective A goal is to obtain have perfectly nested loops with the loop 
levels carrying dependence cycles as much outwards as possible. Wo subsequently apply Then, 

1 5 loop fusion, reduction recognition, scalar replacement / array contraction! and loop distribution 
may be applied to further improve the following v ectorization. Finally vecto r Vector statement 
generation iscanbe performed fat last using the Allen-Kennedy algorithm; for instance). The 
last step consists o f can include optimizations fek- esuch as loop tiling, strip-mining, loop 
unrolling and software pipelining whic h that take into account the target processo r into account. 

20 , 

[0316] The number of optimizations in the third step is may be large, but it may be that not all of 
them are applied to each loop nest. Following the goal of the vectorization and the data 
dependence graphs only some of them are applied. Heuristics are may be used to guide the 
25 application of the optimizations^ that can be applied several times if needed. Let us illustrate this 
with The following code is an example . TABLE US 000^13 of this: 

void f(int** a, int** b, int *c, int i, int j) { 
a[i]D'] = a[i]D'-l]-b[i+l]D-l]; 

30 } 

void g(int* a , int* c, int i) { 
a[i] = c[i] + 2; 

} 

for(i=0; i<N; i++) { 

NY01 1641442 62 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



forO'=l;j<9;j=j++)I 
if (k>0) 

f(a, b, i, j); 

else 

g(d, c, j); 

} 

d[i] = d[i+l] + 2; 

} 

for (i=0; i<N; i++) 

a[i][i] = b[i] + 3; 

[0317] The first step flnd swill find that inlining the two procedure calls is possibles-the n. Then 
loop unswitching iscanbe applied to remove- the conditional instruction of the loop body. The 
second step starts wit h may begin by applying loop normalization and analyses of the data 
dependence graph. A cycle can be broken by applying loop interchange as it is only carried by 
the second level. The two levels ar emay be exchanged^ so that the inner level is vectorizable. 
Before that or also after, wo apply loop distribution r may be applied. Loop fusion iscanbe 
applied when the loop level with induction variablc on i is pulled out of the conditional 
instruction by a traditional redundant code elimination optimization. Finally.! vector code iscan 
be generated for the resulting loops. 

[03 18] So in ln more dotails detail , after procedure inlining, wo obtain: TABLE US 000^1 ^1 the 
following may be obtained: 
for (i=0; i<N; i++) { 

for0'=l;j<9;j=j++) 
if (k>0) 

a[i]D'] = a[i]D-l]-b[i+l]D"-l]; 

else 

dD'] = cD"] + 2; 

} 

d[i] = d[i+l] + 2; 

} 

for (i=0; i<N; i++) 
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a[i][i] = b[i] + 3; 



[0319] After loop unswitching, we obtain: TABLE US 00015 if (k the following may be 
obtained: 
5 ifik>0) 

for (i=0; i<N; i++)-{-fef=£ 

for_(j= 1 ; j<9; j=j++) 

a[i]D'] = a[i]D-l]-b[i+l]D-l]; 
d[i] = d[i+l] + 2; 
10 } 
else 

for (i=0; i<N; i++) { 

for (j=l; j<9; j=j++) 
dD'] = eft + 2; 
15 d[i] = d[i+l] + 2; 

} 



for (i=0; i<N; i++) 

a[i][i] = b[i] + 3; 

20 

[0320] After loop normalization, we obtain: TABLE US 00016 if (k the following may be 

obtained: 

ifik>0) 

for (i=0; i<N; i++)-{-fe¥{ 
25 for_(j=0;j<8;j=j++) 

a[i]D'+l] = a[i]D']-b[i+l]D']; 
d[i] = d[i+l] + 2; 

} 

else 

30 for (i=0; i<N; i++) { 

for (j=0; j<8; j~j++) j= j++) 

dD'] = cD'+l]+2; 
d[i] = d[i+l] + 2; 

} 
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for (i=0; i<N; i++) 

a[i][i] = b[i] + 3; 



5 [0321] After loop distribution and loop fusion, wo obtain: TABLE US 00017 if (k the following 
may be obtained: 
ifik>0) 

for(i=0; i<N; i++) for 

for_(j=0; j<8; j=j++) 

10 a[i]D+l] = a[i]D']-b[i+l][j]; 

else 

for (i=0; i<N; i++) 

for(j=0;j<8;j=j++) 

dD'] = cQ+l]+2; 

15 

for (i=0; i<N; i++) { 

d[i] = d[i+l] + 2; 
a[i][i] = b[i] + 3; 

} 

20 

[0322] After loop interchange, we obtain: TABLE US 00018 if (k t he following maybe 

obtained: 

ifik>0) 

for0'=0;j<8;j=j++) 

25 for fer(i=0 ; i<N ; i++) 

a[i]D'+l] = a[i]D']-b[i+l]D']; 

else 

for (i=0; i<N; i++) 

for(j=0;j<8;j=j++) 
30 d[j] = c[j+l]+2; 

for (i=0; i<N; i++) { 

d[i] = d[i+l] + 2; 
a[i][i] = b[i] + 3; 

N Y0 1 1 64 1 442 65 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



} 



[0323] After vector code generation, we obtain TABLE US 00019 if (k > the following may be 
obtained: 
5 ifik>0) 

for(K0;j<8;j=j++) 

a[0:N-l]Q+l] = a[0:N-l][j] - b[0:N]D"]; 

else 

for (i=0; i<N; i++) 
10 d[0:8] = c[l:9] + 2; 



d[0:N-l] = d[l:N] + 2; 
a[0:N-l][0:N-l] =b[0:N] + 3; 



1 5 4-COMPILER SPECIFICATION FOR THE PACT XPP 



1.1 Introduction 

[0321] A cached RISC-XPP architecture oxploits may exploit its full potential on code that is 
characterized by high data locality and high computational effort. A compiler for this 
20 architecture has to consider these design constraints. The compiler's primary objective is to 

concentrate computational expensive calculations to innermost loops and to make up as much 
data locality as possible for them. 

[0325] The compiler contains m ay contain usual analysis and optimizations. As interprocedural 
25 analysis, lik-e e.g., alias analysis, are especially useful, a global optimization driver i smay be 
necessary to ensure the propagation of global information to all optimizations The following 
sections concentrate on the . The way the PACT XPP influcnccs m ay influence the compiler is 
discussed in the following sections . 



30 4r3-Compiler Structure 

[0326] FIG. 1 l Fig. 1 1 provides a global view of the compiling procedure and shows tbe-main 
steps the compiler mus t may follow to produce code for a system containing a RISC processor 
and a PACT XPP. The next sections focus on the XPP compiler itself, but first the other steps 
are briefly described. 
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4.2.1 Code Preparation 

|"03271 This step takes Code preparation may take the whole program as input and can be 
considered as a usual compiler front-end. It wi Umay prepare the code by applying code 
5 analysis and optimizations to enable the compiler to extract as many loop nests as possible to be 
executed by the PACT XPP. Important optimizations are idiom recognition, copy propagation, 
dead code elimination, and all usual analysis like dataflow and alias analysis. 



10 [0328] Pointer and array accesses are represented identically in the intermediate code 

representation which is built during the parsing of the source program. Hence pointer accesses 
are considered like array accesses in the data dependence analysis as well as in the 
optimizations used to transform the loop bodies. Interprocedural alias analysis, for instance, 
leads in the code shown below to the decision that the two pointers p and q never reference the 

15 same memory^ area, and that the loop body may be successfully handled by the XPP rather than 
by the host processor. 

f©339]-Example of pointer disambiguation: TABLE US 00050 



Handling of Pointer and Array Accesses 



20 



int foo(int *p, int *q, int N) 

{ 

for (i = 0;i i< N; i++H 

{ 

p[i] = q[i]*q[i+l]; 

} 



25 



return p[N-l]; 

} 



main( ) 

int a [100],b[100]; 



30 



intN; 



foo (a, b, N); 



4.2.2 Partitioning 
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[0330] Partitioning docidos m ay decide which part of the program is executed by the host 
processor and which part is executed by the PACT XPP. 



[0331] A loop nest i smay be executed by the host in three cases: [0332] 
5 * if the loop nest is not well- formed, [0333] 

* if the number of operations to execute is not worth it to bc b eing executed on the PACT XPP, 
o r [0331] 

* if it is impossible to get a mapping of the loop nest on the PACT XPP. 

1 0 [0335] A loop nest is said to be well-formed if the loop bounds are computable and the step of 
all loops i sare constant, the loop induction variables are known T and if there is only one entry 
and one exit to the loop nest. 

[0336] I f Another problem may arise with loop nests where the loop bounds are constant but 
1 5 unknown at compile , time it is possible to speculatively generate XPP code which assumes 
adequate iteration counts (loop tiling). But small loop iteration counts at run time can drive 
goncratcd XPP code towards inefficiency. One possible solution is the introduction of a time- 
Loop tiling may allow for overcoming this problem, as will be described below. Nevertheless, 
it could be that it is not worth executing the loop nest on the PACT XPP if the loop bounds are 
20 too low. A conditional instruction testing whether if the loop bounds are large enough fef 
profitable XPP code. Two can be introduced, and two versions of the loop nest af emay be 
produced. One for oxocution w ould be executed on the host processor, and the other fe* 
execution on the XPP. This concept also eases the applicatio n on the PACT XPP when the loop 
bounds are suitable. This would also ease applications of loop transformations needing minimal 
25 iteration counts. , as possible compensation code would be simpler due to the hypothesis on the 
loop bounds. 

1.2.3 RISC Code Generation and Scheduling 

[0337] After the XPP compiler has produced NML code for the loops chosen by the partitioning 
30 phase, the main compiling process mus t may handle the code that will be executed by the host 
processor where instructions to manage the configurations have been inserted. This is the 
objective an aim of the last two steps: [0338] 

* RISC Code Generation and [0339] 

* RISC Code Scheduling. 
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[0310] The first one producoo may produce code for the host processor and the second one 
further optimizes may optimize it further by looking for a better scheduling using software 
pipelining for instance. 

5 

4r3-XPP Compiler for Loops 

[031 11 FIG. 12 describes the Fig. 12 illustrates a detailed architecture and an internal processing 
of the XPP Compiler. It is a complex cooperation between program transformations, included 
in the XPP Loop Optimizations optimizations , a temporal partitioning phase, NML code 
10 generation and the mapping of the configuration on the PACT XPP. 

[0312] Firs t target specific ,, loop optimizations are targeted at the PACT XPP may be applied to 
try to produce innermost loop bodies that can be executed on the array of processors. I f this is 
the case of success , the NML code generation phase i smay be called , otherwise . If not, then 

15 temporal partitioning i smay be applied to obtain get several configurations for en ethe same loop. 
After NML code generation and the mapping phase, it is possible can also happen that a 
configuration will not fit into the PAE array . on tike PACT XPP. In this case,, the loop 
optimizations efe may be applied again with respect to the reasons of failure of the NML code 
generation or of the mapping. If this new application of loop optimizations does not change the 

20 code, temporal partitioning i smay be applied. Furthermore we keep track of , the number of 
attempts for the NML Code Generation and the mappings may be kept track of If too many 
attempts are made- and w ea solution is still 4e-not obtain a solution, we broa k obtained, the 
process ^ may be broken and the loop nest wiHmay be executed by the host processor. 

25 1.3.1 Temporal Partitioning 

[0313] Temporal partitioning spli temay split the code generated for the PACT X PP ininto 
several configurations if the number of operations, i.e.., the size of the configuration , to be 
executed in a loop nest exceeds the number of operations executable in a single configuration. 
This transformation is called loop dissevering T61. These configurations are . See, for example. 

30 Joao M.P. Cardoso & Markus Weinhardt. "XPP-VC: A C Compiler with Temporal Partitioning 
for the PACT-XPP Architecture," Proceedings of the 12th International Conference on Field- 
Programmable Logic and Applications. FPL'2002. 2438 LNCS. 864-874 (2002V These 
configurations may be then integrated in a loop of configurations whose number of execution 
corresponds to the iteration range of the original loop. 
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4^3- Generation of NML Code 

[03111 This step takes Generation of NML code may take as input an intermediate form of the 
code produced by the XPP Loop Optimizations optimizations step, together with a dataflow 
5 graph built upon it. NML code mean then be produced by using tree- or DAG-pattern matching 
techniques [12,13]. After this stop, specific NML optimizations arc applied. For instance, partial 
redundancy elimination and boolean simplification dedicated to optimizing the generated event 
networks are invoked. .. 

10 1.3.3 Mapping Step 

[03151 This A mapping step tak-e smay take care of mapping the NML modules on the PACT 
XPP by placing the operations on the ALUs, FREGs, and BREGs, and routing the data through 
the buses. 

1 5 4t4-XPP Loop Optimizations Driver 

[03161 The obiective A goal of-the loop optimizations used for the PACT XPP is to extract as 
much parallelism as possible from the loop nests in order to execute them on the PACT XPP by 
exploiting the ALU-PAEs as effectively as possible and to avoid memory bottlenecks by means 
of IRAM usago. with the IRA Ms. The following sections explain how they ap emay be 

20 organized and how to take into account the architecture for applying the optimizations. 

4t4t4 Organization of the System 

[03171 FIG. 13 presents the organizatio n Fig. 13 provides a detailed view of the XPP loop 
optimizations T, including their organization. The transformations a^ emay be divided in six 

25 groups. Other standard optimizations and analyses are analysis may be applied in-between. 

Each group is could be called several times. Loops over several groups ma y-can also occurof 
needed. The number of iterations for each driver loop i scan be of constant value or determined 
at compile time by the optimizations i-tsel fthems elves . (e.g^ repeat until a certain code quality is 
reached). In the first iteration of the loop, it can be checked if loop nests are usable for the 

30 PACT X PP-r4t . It is mainly directed to check the loop bounds etc. For instance.! if the loop nest 
is well-formed and the data dependence graph does not prevent optimization, but the loop 
bounds are unknown, then^ in the first iteration loop^ tiling ismaybe applied to get an innermost 
loop that is easier to handle and can be better optimized, and in the second iteration, loop 
normalization, if- conversion, loop interchange and other optimizations af ecan be applied to 
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effectively optimize the loop nost for the XPP. inner-most loops for the PACT XPP. 
Nevertheless, this has not been necessary until now with the examples presented below. 

[03 ^8] With reference to Fig. 13, Group I ensures may ensure that no procedure calls occur in the 
5 loop nest. Group II prcparos may prepare the loop bodies by removing loop-invariant 

instructions and conditional instruction to ease the analysis. Group 111 gcncratcs may generate 
loop nests suitable for the data dependence analysis. Group IV contains may contain 
optimizations to transform the loop nests to obtain get data dependence graphs that are suitable 
for vectorization. Group V contains may contain optimizations ensuring t hat ensure that the 
10 innermost loops can be executed on the PACT XPP. Group VI contains may contain 

optimizations that further extract parallelism from the loop bodies. Group VII contains target 
specific may contain optimizations more towards optimizing the usage of the hardware itself 

[03^19] In each group,, the application of the optimizations dcpcnds may depend on the result of 
1 5 the analysis and the characteristics of the loop nest. Hence, for instance, the application of a 

transformation out of Group IV For instance, it is clear that not all transformations in Group IV 
are applied. It depends on the data dependence graph computed before. 

A.A.2 Loop Preparation 

20 [0350] The optimizations of Groups I, II and III of the XPP compiler may generate loop bodies 
without procedure calls, conditional instructions and induction variables other than loop control 
variables. Thus A loop nests, where the innermost loops are suitable for execution on the PACT 
XPP, are may be obtained. The iteration ranges are may be normalized to ease data dependence 
analysis and the application of other code transformations. 

25 

^1A3 Transformation of the Data Dependence Graph 

[0351] The optimizations of Group IV ar emay be performed to obtain innermost loops suitable 
for vectorization with respect to the data dependence graph. Nevertheless^ a difference with 
usual vectorization is that a dependence cycle, tha twhich would normally prevent any 
30 vectorization of the code, does not prevent the optimization of a loop nest for the PACT XPP. 
If a cycle is due to an anti-dependence, then it could be that it won't w ill not prevent 
optimization of the code as stated in f74r Markus Weinhardt & Wayne Luk, "Pipeline 
Vectorization," IEEE Transactions on Computer- Aided Design of integrated Circuits and 
Systems, 20(2):234-248 (February 2001). Furthermore.! dependence cycles will not provon t pre- 
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vent vectorization for the PACT XPP when it consists only of a loop-carried true dependence on 
the same expression. If cycles with distance k occur in the data dependence graph, then this 
is -can be handled by holding k values in registers. This optimization is of the same class as 
cycle shrinking. 

[0352] N evertheless^ limitations due to the dependence graph exist. Loop nests cannot be 
handled if some dependence distances are not constant^ or unknown. If only a few 



overcome^ by using the traditional vectorization algorithm that sorts topologically the strongly 
connected components of the data dependence graph (statement reordering), and then 
applies applying loop distribution. This way, loop nestSi which can be handled by the XPP 
ar- ePACT XPP and some by the host processor, can be obtained. 

AAA Influence of the Architectural Parameters 

[0353] Some hardware specific parameters may influence the application of the loop 
transformations. The compiler estimates the number of operations and memory accesses which 
arc consumed withi n that a loop body : performs may be estimated at each step. These 
parameters may influence loop unrolling, strip-mining, loop tiling and also loop interchange 
(iteration range). 

[0351] The table below lists the parameters that may influence the application of the 
optimizations. For each of them^ two valuos data are given: a starting value computed from the 
loop- and a restriction value which is the value the parameter should reach or should not exceed 
after the application of the optimizations. Vector length depicts the range of the innermost 
loops, i.e., the number of elements (i.e. 32 bit data) of an array accessed in the loop body. 
Reused data set size represents the amount of data that must fit in the cache. I/O IRAMs, ALU, 
FREG, BREG stand for the number of IRAMs, ALUs, FREGs, and BREGs^ respectively-that 
constitute , of the PACT XPP. The dataflow graph width represents the number of operations 
that can be executed in parallel in the same pipeline stage. The dataflow graph height 
represents the length of the pipeline. Configuration cycles amounts to the length of the pipeline^ 
and to the number of cycles dedicated to the control. The application of each optimization may 




dependencies prevent the optimization of the whole loop nest, this could be 



[0355] 

* decrease a parameter's value (-), [0356] 

* increase a parameter's value (+), [0357] 
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* not influence a parameter (id), o r [0358] 

* adapt a parameter's value to fit into the goal size (make fit). 

[0359] Furthermore, some resources must be kept for control in the configuratiom-thi s. This 
5 means that the optimizations should not make the needs exceed more than 70-80% ef-each 
resource. TABLE US 00051 Parameter Goal Starting Value Vector length IRAM oizo (128 
Loop count 

10 

ParameterGoalStarting ValueVector lengthlRAM size (128 words) Reused Loop countReused 
data set size Approx sizeApprox . cache Algorith m sizeAlgorithm analysis/loop sizes size I/O 
IRAMo XPP sizesI/O IRAMsXPP size (16) Algorithm inputs + outputs ALU XPP size (<61) 
ALU opcode estimate BREG XPP oizo (<80) BREG opcode estimate FREG XPP size (<80) 

15 FREG opcode estimate Dataflow graph width High Algorithm dataflow graph Dataflow graph 
height Small Algorithm dataflow graph Configuration cycles .ltoreq.command line Algorithm 
analysis paramoto r outputsALUXPP size (< 64)ALU opcode estimateBREGXPP size (< 
80)BREG opcode estimateFREGXPP size (< 80)FREG opcode estimateData flow graph 
widthHighAlgorithm data flow graphData flow graph heightSmallAlgorithm data flow 

20 graphConfiguration cycles= command line parameterAlgorithm analysis 

[0360] Here are some additional A dditional notations used in the following descriptions^-Le t are 
as follows, n beis the total number of processing elements available, r 7 is the width of the 
dataflow graph, in-js the maximum number of input values in a cyc^ and out-js the maximum 
number of output values possible in a cycle. On the PACT XPP, n is the number of ALUs, 

25 FREGs and BREGs available for a configuration, r is the number of ALUs, FREGs and BREGs 
that can be started in parallel in the same pipeline stage.^ and^ in and out amount to the number 
of available IRAMs. As IRAMs have 1 input port and 1 output port, the number of IRAMs 
yields directly the number of input and output data. 

30 [0361] The number of operations of a loop body i smay be computed by adding all logic and 

arithmetic operations occurring in the instructions. The number of input values is the number of 
operands of the instructions regardless of address operations. The number of output values is 
the number of output operands of the instructions regardless of address operations. To 
determine the number of parallel operations, input and output values as well as , and the 
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dataflow graph must be considered. The effects of each transformation on the architectural 
parameters are now presented in detail. 

Loop Interchange 

5 [0362] Loop interchange te may applied when the innermost loop has a very small too narrow 
iteration range. In that case, loop interchange allows havin g may allow for an innermost loop 
with a more profitable iteration range. It i scan also be influenced by the layout of the data in 
memory. It i scan be profitable to data locality to interchange two loops to get a more practical 
way to access arrays in the cache and therefore prevent cache misses. It is of course also 

1 0 influenced by data dopondoncos dependencies as explained earlier. TABLE US 00052 

Parameter Effect Vector length + Reused data set size make fit I/O IRAMs id ALU id BREG id 
FREG id Dataflow graph width id Dataflow graph height id Configuration cycles above. 
ParameterEffectVector length+Reused data set sizemake fitl/O 
IRAMsidALUidBREGidFREGidData flow graph widthidData flow graph 

15 heightidConfiguration cycles- 

Loop Distribution 

[0363] Loop distribution temay be applied if a loop body is too big to fit on the PACT.XPP.4te 
A main effect of loop distribution is to reduce the processing elements needed by the 
configuration. Reducing the need for IRAMs is a side effect of this optimization. TABLE US 
20 00053 Parameter Effect Vector length id Reused data set size id I/O IRAMs make fit ALU 
make fit BREG make fit FREG make fit Dataflow graph width — Dataflow graph height can 
only be a side effect. 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMsmake fitALUmake 
fitBREGmake fitFREGmake fitData flow graph width-Data flow graph height - Configuration 
25 cycles - 

Loop Collapsing 

[0361] Loop collapsing iscanbe used to make the loop body use more memory resources. As 
several dimensions are merged, the iteration range is increased and the memory needed is 
increased as well. TABLE US 00051 ParameterEffectVector 
30 ParameterEffectVector length + Reused data set size + I/O IRAMs + ALU id BREG id FREG 

id Dataflow +ALUidBREGidFREGidData flow graph widt h + Dataflo w +Data flow graph height 
+ Configuration cycles + 
Loop Tiling 



NY01 1641442 



74 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



[0365] Loop tiling, as multi-dimensional strip-mining, is influenced by all parameters , it, is . It 
may be especially useful when the iteration space is by far too big to fit in the IRAM, or to 
guarantee maximum execution time when the iteration space is unbounde d (see Section 1 A .7). 
Loop tiling makes . See the discussion below under the heading "Limiting the Execution Time 
5 of a Configuration." It can then make the loop body fit with respect to the resources of the 
PACT XPP, namely the IRAM and cache line sizes. The size of the tiles for strip-mining and 
loop filin g tiling can be computed bv-as: 

tile size = resources available for the loop body / resources necessary for the loop body i 

1 0 [0366] The resources available for the loop body are the whole resources of the PACT XPP for 
the curren t this configuration.-One_A tile size ma y can be computed for the data and another one 
for the processing elements. The final tile size is then the minimum efbetween these two 
computations. If, for . For instance, when the amount of data accessed is larger than the capacity 
of the cache, loop tiling ea nmay be applied which is shown bo according to the following 

1 5 example . [03671 Example o f code for loop tiling for the PACT XPP : TABLE US 00055 

for(i~0;i <~ I018576;i++) for(i~0; i<~ 1018576; I+~ CACHE SIZE) <loop body> for(j-0; j< 
CACHESIZE; j+~IRAM_SIZE) for(k~0; k<IRAM_SIZE;k++) <tilcd loop body> A 

for (i=0; i<=l 048576; for (i=0; i<=1048576; i+= CACHE SIZE) 

20 <loop bodv> for (j=0; i<CACHE SIZE; i+=IRAM SIZE) 

for (k=0; k<IRAM SIZE; k++) 

<tiled loop body> 



[0368] TABLE US 00056 Parameter Effect Vector length make fit Reused data set size make 
fit I/O IRAMs id ALU id BREG id FREG id Dataflow graph width + Dataflow graph height 
ParameterEffectVector lengthmake fitReused data set sizemake fitl/O 
IRAMsid.ALUidBREGidFREGidData flow graph width+Data flow graph height + 
30 Configuration cycles + 

Strip-Mining 

[0369] Strip-mining is may be used to match make the amount of memory accesses of the 
innermost loop fit with the IRAM IRAMs capacity. Usually the necessary number o f The 
processing elements does not build the bottlonock, do not usually represent a problem as the 
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PACT XPP provides 6^1 , has 64 ALU-PAEs which isshouldbe sufficient to execute most any 
single loop bodies. Howovo rb ody. Nevertheless , the number of operations can be also taken 
into account the same way as the data. TABLE US 00057 Parameter Effect Vector 
ParameterEffectVector length z make fit Reused fitReused data set size id I sizeidl /O IRAMs— 
5 ALU id BREG id FREG id Datafiow -ALUidBREGidFREGidData flow graph width-+ 

Dataflow graph height id Configuration cycles id +Data flow graph heightidConfiguration 
cvclesid 

Loop Fusion 

10 [0370] Loop fusion i smay be applied when a loop body does not use enough resources. In this 
case^ several loop bodies ar ecan be merged to obtain a configuration using a larger part of the 
available resources. TABLE US 00058 Parameter Effect Vector length id Reused data set size 
id I/O IRAMs 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMs + ALU + BREG + FREG^ 
1 5 Dataflow +Data flow graph width id Dataflow w idthidData flow graph height :+ Configuration 
cycles + 

Scalar Replacement 

[0371] The amount of memory needed by the loop body should always fit ietein the IRAMs. 
20 Due to thi sa scalar replacement optimization, some input or output array data is data represented 

by array references that should be stored in IRAMs may be replaced by scalars^ that are either 

stored in FREGs or kept on buses. TABLE US 00059 Parameter Effect Vector length id Roused 

data sot size id I/O IRAMs ALU id BREG id/+ FREG id/+ Dataflow graph width id/ 

Dataflow graph height id/ Configuration cycles id 
25 ParameterEffectVector lengthidReused data set sizeidl/O IRAMs- 

ALUidBREGid/+FREGid/+Data flow graph widthidZ-Data flow graph heightidZ-Configuration 

cvclesid 

Loop Unrolling I Loop Collapsing I Loop Fusion 
[0372] Loop unrolling, loop collapsing-arid^ loop fusion ar eand loop distribution may be 
30 influenced by the number of operations within of the body of the loop nest and the number of 
data inputs and outputs of these operations, as they modify the size of the loop body. The 
number of operations should always be smaller than n, and the number of input and output data 
should always be smaller than in and out. Note that although the number of configuration cycles 



NY01 1641442 



76 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



increases, tho throughput incroasos as woll resulting in a bottor porformanco. TABLE US 00060 
Paramotor Effect Vector length id Rousod data sot size - id I/O IRAMs 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMs + ALU + BREG + FREG^ 
Dataflow +Data flow graph width id Dataflow w idthidData flow graph height + Configuration 
5 cycles + 

Loop Distribution 

[0373] Like the optimizations above, loop distribution is influenced by the number of 
operations of the body of the loop nest and the number of data inputs and outputs of these 
10 operations. The number of operations should always be smaller than n, and the number of input 
and output data should always be smaller than in and out. The following table describes the 
effect for each of the loops resulting from the loop distribution. TABLE US 00061 Parameter 
Effect V octor length id Rousod data sot size id I/O IRAMs 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMs - ALU - BREG - FREG- 
1 5 Dataflow -Data flow graph width id Dataflow w idthidData flow graph height - Configuration 
cycles - 

Unroll-and-Jam 

[0371] Unroll-and-Jam consists o f may include unrolling an outer loop and then merging the 
20 inner loops. It must compute the unrolling degree u with respect to the number of input 

memory accesses m and output memory accesses p in the inner loop. The following inequality 
must hold: u* m.ltoroq.inu*p.ltoroq. m = in A u* p = out. Moreover^ the number of operations of 
the new inner loop must also fit on the PACT XPP. Tho unrolling degree u is computed using 
the following formula: u~min(u.sub.PAE,u.sub.RAM), where u.sub.PAE and u.sub.RAM are 
25 computed by the same formula: u.left brkt top.resources available/. SIGMA.resources 

nocdod.right brkt bot. Onco more although tho number of configuration cycles incroasos, the 
throughput incroasos as well resulting in bettor porformanco. TABLE US 00062 Paramotor 
Effect Vector length id Reused data set size + 

ParameterEffectVector lengthidReused data set size+ I/O IRAMs + ALU + BREG + FREG-+ 
30 Dataflow +Data flow graph width id Dataflo w widthidData flow graph height + Configuration 
cycles + 

4.4.5 Target Specific Optimizations 
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[0375] At this step other optimizations, specific to the PACT XPP, may can be appliod. made. 
These optimizations deal mostly with memory problems and dataflow considerations. This is 
the case ferof shift register synthesis, input data duplication (similar to scalar or array 
privatization), andor loop pipelining. 

5 

Shift Register Synthesis 
r03761 This A shift register synthesis optimization deals with array accesses occurrin g that occur 
during the execution of a loop body. When several values of an array are alive for different 
iterations, it i scan be convenient to store them in registers,, rather than accessing memory each 

10 time they are needed. As the same value must be stored in different registers depending on the 
number of iterations it is alive, a value shares several registers and flows from a register to 
another at each iteration. It is similar to a vector register allocated to an array access with the 
same value for each element. This optimization is performed directly on the dataflow graph by 
inserting nodes representing registers when a value must be stored in a register. In the PACT 

15 XPP, it amounts to ster estoring it in a data register. A detailed explanation can be found in £±4^ 
Markus Weinhardt & Wayne Luk. "Memory Access Optimization for Reconfigurable Systems." 
IEEE Proceedings Computers and Digital Techniques. 48(3) (May 20011 

[0377] Shift register synthesis t smay be mainly suitable for small to medium amounts of 
20 iterations where values are alive. Since the pipeline length increases with each iteration for 
which the value has to be buffered, the following method is better suited for medium to large 
distances between accesses in one input array. 

[0378] N evertheless^ this method works -may work very well for image processing algorithms 
25 which mostly alter a pixel by analyzing itself and its surrounding neighbors. Some resources are 
needed to produce guards on input or output values to ensure tho semantics of the produced 
code, as all registers must be filled to allow the code to produce correct values. TABLE US 
00063 Parameter Effect Vector length + 

ParameterEffectVector length+ Reused data set size id I/O IRAMs id ALU + BREG id/+ FREG 
3 0 + Dataflo w sizeidl/O IRAMsidALU+BREGid/+FREG+Data flow graph width Dataflo w -Data 
flow graph height + Configuration cycles + 
Input Data Duplication 

[0379] This An input data duplication optimization is orthogonal to shift register synthesis. If 
different elements of the same array are needed concurrently, instead of storing the values in 
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registers, the same values ape -may be copied ietein different IRAMs. The advantage against 
shift register synthesis is the shorter pipeline length, and therefore the increased parallelism, and 
the unrestricted applicability. On the other hand, the cache-IRAM bottleneck bottle-neck can 
affect the performance of this solution, depending on the amounts of data to be moved. 
5 Nevertheless wo assume , it is assumed that cache- IRAM transfers are negligible to transfers in 
the rest of the memory hierarchy. TABLE US 00061 Parameter Effect Vector length id Reused 
data set size id I/O IRAMs + ALU id BREG id FREG id Dataflow 



10 ParameterEffectVector lengthidReused data set sizeidl/O IRAMs+ALUidBREGidFREGidData 
flow graph width + Dataflo w +Data flow graph height - Configuration cycles id cyclesid 
FIFO Pipelining 

[0380] This optimization is used to store an array in the memory of the PACT XPP, when the 
size of the array is smaller than the total amount of memory of the PACT XPP, but larger than 

15 the size of an IRAM. It can be used for input or output data. Several IRAMs in FIFO mode are 
linked to each other, and the input/output port of the last one is used by the computing network. 
A condition to use this method is that the access pattern of the elements of the array must allow 
using the FIFO mode. It avoids to apply loop tiling/strip-mining to make an array fit on the 
PACT XPP. TABLE US 00065 Parameter Effect Vector length id Roused data sot size id I/O 

20 IRAMs + ALU id BREG id FREG id Dataflow graph width id Dataflow graph height 



ParameterEffectVector lengthidReused data set sizeidl/O IRAMs+ALUidBREGidFREGidData 
flow graph widthidData flow graph height - Configuration cycles + 
Loop Pipelining 

25 |"03 8 1 1 This A loop optimization synchronizes p ipelining optimization may include 

synchronizing operations by inserting delays in the dataflow graph. These delays ar emay be 
registers. For the PACT XPP, it amounts to stef estoring values in data registers to delay the 
operation using them. This is the same as pipeline balancing performed by xmap. TABLE US 
00066 Parameter Effect Vector length id Reused data set size id I/O IRAMs id ALU id BREG + 

30 FREG + Dataflow graph width + Dataflow 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMsidALUidBREG+FREG+Data 
flow graph width+Data flow graph height /id Configuration -/idConfiguration cycles - 
Tree Balancing 
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|"03 821 This A tree balancing optimization consists i n may include balancing the tree representing 
the loop body. It roducos m ay reduce the depth of the pipeline, thus reducing the execution time 
of an iteration, and increases parallelism. TABLE US 00067 Parameter Effect Vector length id 
Reused data set size id I/O IRAMs id ALU id BREG id FREG id Dataflow graph width id 
5 Dataflow graph height — may increase parallelism. 

ParameterEffectVector lengthidReused data set sizeidl/O IRAMsidALUidBREGidFREGidData 
flow graph widthidData flow graph height- Configuration cycles - 
A.A.6 Memory Optimizations 

Optimization of Memory Accesses 

10 [0383] A particular concern for the PACT XPP are memory accesses. These need to be reduced 
in order to get enough parallelism to exploit. The loop bodies are freed of unnecessary memory 
accesses when shift register synthesis and scalar replacement are applied. Scalar replacement 
has the same effect as redundant load/store elimination. Array accesses are taken out of the loop 
body and handled by the host processor. It should be noted that redundant load/store elimination 

15 takes care not only of array accesses but also of accesses to global variables and records. On the 
other hand, shift register synthesis removes some accesses completely from the code. 

Access Patterns and Loading of the Data into the IRAMs 
[0381] A major issue is also how to load data in the IRAMs efficiently in terms of resources 
20 consumed and in terms of execution time. Non linear access patterns consume a lot of resources 
to compute the addresses, moreover their loading into the IRAMs can then be delayed by cache 
misses and these costly computations. Furthermore it is profitable for the execution time when 
the accesses are linear between the IRAMs and the ALU-PAEs. 

25 [0385] As already stated in section 2.2.5, a methods exist to prevent these problems. They can be 
applied at different levels: [0386] 

* on the data layout, [0387] the source code, or [0388] on the data transfer. 

* the source code, or 

* on the data transfer. 

30 [0389] By modifying the data layout, the access patterns are simplified, thus saving resources 
and computation time. This is achieved by array merging, for instance. 

[0390] The source code itself can be modified to simplify the access patterns. This is the case 
for matrix multiplication, presented in the case studies, where a matrix is transposed to obtain 
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an access line - by line - byline and not row-by-row, or in the example presented at the end of the 
section. On the other hand, loop tiling allows filling the IRAMs by modifying the iteration 
range of the innermost loop. 



5 [0391] Furthermore the access patterns can be modified by reordering the data. This can happen 
in two ways, as already described in section 2.2.5: [0392]; 

* either by loading the data in the IRAMs in a specific order, [0393] or by reordering 
dynamically the data. 

* or by reordering dynamically the data. 

1 0 [0391] The first data reordering strategy supposes a constant stride between two accesses, if this 
is not the case, then the second approach is chosen. More resources are needed, as the flow of 
data is reordered by computations done the PACT XPP to feed the ALU-PAEs, but the data are 
accessed linearly inside the IRAMs. 

1 5 [0395] Finally if none of these methods is applicable, and the access patterns are too costly to 
be synthesized on the XPP array, the index expressions are computed in advance and loaded 
into an IRAM that is used as an index for accessing the array values stored in another IRAM. 
For instance, with; the following loop the values [0.0.0.1.1.1. . . . .7.7.8 [0.0.0. 1.1.1. ....7.7.8 } are 
loaded im an IRAM, and will feed the address input of the IRAM containing array b. TABLE 



[0396] In this example, where only one expression causes problem, another solution is to apply 
loop tiling to prune it. The resulting loop is shown below. The expression i/3 evaluates to 0, as it 
25 is always smaller than 3. This is found by the value range analysis. The access pattern can then 
be synthesized on the XPP array to access the array values in the IRAMs. TABLE US 00069 



20 



US 00068 



for(i=0;i <= 24;i++) 
a[i]=b[i/3]; 



fbr(r0;j <= 7;j++) 



for(j"=0;j <= 7;j++) for(i~0;i < 3;i++) for(i~0; i < 3; 



i++) { a[i+3*j] 



for (i=0; i < 3:i++^ 



for (i=0; i < 3: i++). ( 



30 



ari+3*i) = b[i/3+j]; 



a[i+3*j] = b[j]; 



} 



} 



<\A.l Limiting the Execution Time of a Configuration 
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[0397] The execution time of a configuration must be controlled. This is ensured in the 
compiler by strip -mining and loop tiling that also t ake care that th enot more input data does not 
exceed than the IRAMs IRAM's capacity ^ come in the PACT XPP in a cycle. This way the 
iteration z range of the innermost loop that is executed on the PACT XPP is limited, and 
5 therefore its execution time. Moreover^ partitioning ensures that loops, whose execution count 
can be computed at run time, are going to be executed on the PACT XPP. This condition is 
trivial for for-loops, but for while-loops, where the execution count cannot be determined 
statically, a transformation lik- eexemplified by the one sketche d code below i scan be applied. As 
a result, the inner for- loop can be handled by the PACT XPP. 
10 [0398] Transformation of while loops: TABLE US 00070 while (ok) { while (ok) 
<loop body> for (i=0; i<100 && ok; i++) { 

} <loop body> 

} 

15 5-CASE STUDIES 
5.1 Introduction 

[0399] The following chapter contains six case studies from fields where a RISC-XPP 
combination fits best. As typical DSP examples a finite impulse response (FIR) filter and a 
20 viterbi decoder are investigated. Image processing algorithms are^ represented by an edge 

detector function, the inverse discrete cosine transformation from an MPEG codec codes and a 
wave let w avelet transformation. Furthermore a matrix multiplication and the quantization 
functions of the MPEG codoc codes are investigated. 

25 [0100] All algorithms are transformed with various optimizations presented in the preceding 

chapters. The result of the transformations is presented in C code, which is sometimes shortened 
for better understanding. In a last step the code is split in C code, which runs on the RISC host, 
and C code which runs on the XPP array. Furthermore the XPP configuration is presented as a 
dataflow graph which should generally give a better understanding, since some features of the 

30 XPP array cannot be presented in C adequately. 

#r2-Conventions 

5.2.1 Configuration and IRAM Names names 



NY01 1641442 



82 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



[0101] Configurations are named by a prefix XppCfg_ and a name. They are defined as C 
functions without parameters and without a return value. 

[0102] The communication with the rest of the system is done over the IRAMs exclusively. 
5 They are identified by a number between 0 and 15. In the C representation of configurations 
they are differently declared depending on how they are used: [0103] 

* As a pointer of type (unsigned) char*, short*, or int*, respectively. When this representation is 
used, the IRAM is used in FIFO mode. Although this notation is not totally correct, it describes 
10 the access mode best. IRAMs in this mode are read and written sequentially starting with 
address 0. No address generators are needed. The access is illustrated by using the post 
increment notation *iram<N>++. When the declaration is of a smaller data type than integer, 
this silently implies that converters to 32 bits are produced by the compiler. [0101] 

15 * As arrays of type (unsigned) char[512], short[256], or int[128], respectively. The access 

notation in C is then iram<N> [offset expression]. In contrast to FIFO access dedicated address 
generators must be synthesized. As mentioned above, the usage of data types smaller than 
integer implies automatically generated data type converters. 

[0105] All code parts outside a _XppCfg_ -prefixed function are meant to run on the RISC host. 
20 The RISC code contains, besides normal C statements, calls to the compiler known functions 
which are presented in the hardware chapter. 

5.2.2 Endianess 

[0106] We assume big endian data layout. This means that the string representation of the word 
25 "PACT XPP" loaded to an IRAM causes the following IRAM content. TABLE US 00071 
Address Content 0x00 0 

AddressContent0x000 x504 14354 24 |-_|A-: « 16 \~ZC-'_ « 8 \-^T-[) 0x0+ 

0010x20585050 (— H« 24 \-JG-'_ « 16 \-JP-[ « 8 

[0107] Similarly, loading an array of 4 16-bit (short) values with the values 0x1234, 0x5678, 
30 0x9abc and OxdefO respectively, causes the following content. TABLE US 00072 Address 
Content 0x00 0x12315678 0x01 0 
AddressContent0x000xl23456780x010x9abcdef!Q 
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[0108] There is no special reason for this choice, little endian order would be possible, too. Of 
course^ the predefined modules in the next section must then be adapted to the changed data 
layout. 



5 5.2.3, Predefined Modules 

[0109] For better readability of the examples some rcdcfinc d predefined modules are used. In 
the following subsections they are shortly described and their dataflow graphs are given. 

Up Counters counters 

1 0 [01 10] The counters are used on one hand to drive the IRAM reads and writes and, on the other 
hand, to generate event sequences for the conversion modules presented next. The different 
implementations are described in [12] in detail. 

Conversion Modules 

15 [01 11] Predefined conversion modules are used throughout the case studies. The compiler 

handles them as compiler known functions. The compiler either generates conversion modules 
which produce a sequential stream of converted values, or it generates modules which simply 
split packets into parallel streams which then can be processed concurrently. FiGFig. 14 shows 
the implementations of the converters which convert to one stream. They output one 8/16-bit 

20 value per cycle. The input connectors expect data packets with packed values of the shorter data 
type. Furthermore the selector inputs need special event sequences for correct operations. 

[01 12] The second type of converters, which can only be used if dependences allow it, simply 
split a data packet in 2 or 4 streams with boolean B oolean operations, and do a sign extension if 
25 necessary. Since the implementations are straightforward, the dataflow graphs are omitted. 

5r3-Performance Evaluatio n Procedure 
5.3.1 Target Hardware Platform 

[01 13] The case studies are based on the basic design presented in chapter 2.5 . above. The 
30 following parameters were used for the evaluation design: TABLE US 00073 

Unit Frequency 
RISC core 400 MHz 

XPP Cache Controller 400 MHz 1 preload FIFO stage 
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XPP PAE Array 200 MHz 8 .timoo. x 8 ALU PAE's, 16 IRAM ports, 4 I/O0 

Ports 

Storage Frequency Size 

5 ICache 400 MHz 64 KB fully associative 

cache line 32 Bytes 
DCache 400 MHz 128 KB fully associative 

cache line 32 Bytes 
write-back/ write allocation 
10 IRAMs 400 MHz 32 KB 16 ports .timos. x 4 shadows .timos. x 128 ints .timos. x 32 

bits Max 

Bus Bus Frequency Bus width Max Throughput 
ICache -PAE 400 MHz 32 bit 1600 MB/s 

15 DCache -IRAMs 400 MHz 128 bit 6400 MB/s 

SDRAM SDRAM 100 MHz 32 bit 400 MB/s Read Burst: 7-1-1-1-1-1-1-1 

Write Burst: 1-1-1-1-1-1-1-1 

[01 11] As a simplification, we do not consider alignment, assuming a cache miss every thirty- 
20 two bytes, when reading succeeding memory cells. We may do this, because we potentially 

omit only a single cache miss, that potentially occurs, if the array spans one more cache line due 
to misalignment. 

[0115] Execution timestunes, in 400 MHz cycles: TABLE US 00071 

25 Resource t( data size [bits] ) Resource ^ [400 MHz cycles] 

ICache Hit: 

ICacheHit: ICache -> ceil(data size / 32) 

PAE Array 

DCache Hit DCache -> IRAM or ceil(data size / 128) IRAM > DCache 

30 Cache Read Miss RAM -> Cache roundUp(data size, 256 ) / (8 * 32/((7 + 7 * 1) * 4) 

= ceil(data size * 56/256) 
Cache Write-Back Cache -> RAM roundUp(data size, 256 ) / (8 * 32 / ((8 * 1) * 4) 

= ceil (data size * 32/256) 
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Cache Write Miss IRAM -> RAM: Cache Read Miss + Cache Road Miss + 
Transfer(Write) 

= ceil Write Transfer (data size * 56/256) + ceil QRAM > Cache) 

ceil( data size / 128 ) 

5 Cache Read Miss + 

Write Transfer 

(IRAM -> Cache) 

Execution PAE Array Configuration execution cycles * 2 

10 

[0116] Whenever there are no pipeline stalls, the different units and busses can work in parallel. 
Thus the total execution time is defined by the following formula, where RAM transfer cycles 
summarizes the cycles of the cache read misses and the cache write-back cycles: 

15 max ( Sum (Execution cycles), 

Sum (RAM transfer cycles), 

max ( Sum(ICache transfer cycles), 

Sum(DCache transfer cycles))) [cycles @ 400 MHz] 

20 [01 17] If there are pipeline stalls, the outer maximum is replaced^ by a sum, reflecting the fact, 
that the units have to wait for each other to finish. 

[01 18] Only the amount of data that actually has to be transferred, is considered. Data that is 
already in a cache or in the IRAMs, is not accounted for. 

25 

[01 19] For the startup case, the caches are assumed to be empty. Only the read data is 
considered, as the write-backs of the first iteration will take place in the next iteration. Due to 
the dependences, the above formula changes to a sum over all configurations of the following- - 
per configuration— term: 

30 

ICache read miss +^ 

max (ICache transfer cycles, Data cache read miss . sub . 1 + Sum.sub.i~2 . . . 

Sumi=2.. n-1( max (Data cache read miss.sub.j m issi, DCache 

transfor.sub.i t ransferi -1))^ + DCache transfer. sub .n)+ transfern) + 
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Execution cycles [cycles @ 400 MHz] ^ 

[0120] This double sum converges to the previous formula for any non-trivial number of IRAM 
preloads. Also the RAM cycles dominate the transfer cycles by an order of magnitude. 
5 Therefore this more complicated computation method is only used for the trivial cases. 

[0121] For the average xase only data, that are read for the first time, are accounted for. The 
average case is defined as the iteration after an infinite number of iterations: all data that can be 
reused from the previous iteration are in the cache. All data that are used for the first time must 
10 be fetched from RAM and all data that are defined, but are not redefined by the next iteration 
have to be written back to the cache and the RAM. 

[0122] The use of the XppPreloadClean instruction is a special case: no write allocation takes 
place, except at the start and the end of the array, if it is not aligned to a cache line boundary. 
15 These burst transfers are neglected. Also no read transfer from the cache to the IRAM takes 
place. 

5.3.2 Evaluation Procedure 

20 [0123] As mentioned above, all examples are transformed with various transformations and 

intermediate results are presented in C code on a regular basis. Wherever possible it is tried to 
present valid C code. Nevertheless in some examples it is necessary to use features which are 
not expressible in the source language. These then appear in comments within the source code. 

25 [0121] After the partition step, configurations are hand written in NML to simulate the compiler 
code generation step. Placement and routing is done automatically by the mapping tool XMAP. 
For convenience the NML feature to define modules is used. In some cases, the objects in the 
critical path are placed relatively to each other, as this has proven to improve the execution 
performance drastically. 

30 

[0125] Each example lists the estimated data transfer performance in a table as the one below. 
The estimation assumes a cache controller which works with the RISC frequency which is twice 
the frequency of the XPP array, and four times the frequency of the 32-bit main memory bus. 
The Cache-IRAM transfers are executed with full cache controller speed over an 128-bit bus. 
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All values are scaled to the cache controller frequency. The table below shows a typical data 
transfer estimation. TABLE US 00075 



10 DataSize [bytes]Cache MissesRAM - Cache [cache cycles]Cache - I RAM [cache 
cycles1Preloadsarrayl2568 
(Every 32 bytes one cache miss)448 
(4*14 cache cycles penalty for cache read miss) 16 

(16 bytes per cvcle)scalar241561 Sum50417Writebacksoutputl2568704 

15 (4*14 cycles penalty for cache write miss (write allocation) + size*4/4 transfer cycles)16 
(16 bytes per cycle) Sum70416 

[0126] A cache read miss causes a 1 4 cycles penalty for the burst transfer on the main memory 
bus which calculates to 4* 14=56 cache cycles to load a 32 byte cache line from main memory. 
If a write miss occurs, the cache controller write allocation must first load the affected cache 
20 line before it can be altered and written back. By using XppPreloadClean, write misses can be 
avoided. Then only the cache-RAM transfer with a 32-bit word every 4 cache cycles must be 
accounted for. For this reason, some examples show a smaller number of write-back cache 
misses than expected. ,. 

25 [0127] The XPP execute cycles are calculated by taking the double cycle difference (scaling to 
cache frequency) between the end of the configuration execution and the start of the 
configuration execution. The NML sources are implemented so that^ configuration loading and 
configuration execution do not overlap. This is done by means of a start object which is 
configured last and creates an event to start execution. The cycle measurements^ for the XPP 

30 only include the code which is executed in the configurations, i.e. in the loops of the evaluated 
function. The^ remaining control code, i.e. if statements, is not included. It is possible to neglect 
this remaining code on the RISC processor, since this code is executed in parallel to the XPP 
and is significantly shorter. 
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[0128] On the reference system, this code is executed in sequence to the code of the 
configurations, so it cannot be neglected. Moreover, splitting the code for the reference system 
into many small units prevents many optimizations for that system, making the measurements 
unrealistic. Thus the complete loop is timed on the reference system for those cases studies that 
5 suffer most from these effects. 

[0129] The performance data of the reference system were measured by using a production 
compiler for a 32 bit fixed point DSP with a maximum instruction issue of four, an average 
instruction issue of approximately two and a one cycle memory access to on-chip high speed 
10 RAM. This allows to simply add the data cache miss cycles to the measured execution time to 
obtain realistic execution times for a memory hierarchy and off- chip RAM. Since the DSP 
cannot handle 8-bit data types reasonably, the sources were adapted to work with short, int and 
long types only to get representative results. 

15 [0130] The results are summarized in another table. An example is shown below. All values are 
converted to the highest frequency (cache / RISC cycles). For each configuration the data access 
cycles and the instruction access cycles are listed for RAM and cache accesses. Then the 
execution cycles are given for both the XPP and the reference system. Finally the speedup is 
presented as reference execution cycles / XPP execution cycles. Using the formulas of section 

20 5.3.1 p rovided above, execution cycles and speedup are given for all three different 

possibilities, where the data can be located initially: in-IRAM (column core— for the XPP only, 
for the RISC, the in-cache column is used instead), in-cache or in-RAM. 

[0131] In the example performance evaluation table below the first three rows list the 
25 performance data of each configuration separately, and the last row lists the performance data of 
all configurations of the function. The data transfer cycles for the separate configurations, Data 
Access, represent all preloads and write-backs which would be necessary for executing the 
configuration alone. The data transfer cycles for executing all configurations is less than the 
sum of the cycles for the separate configurations, because data can remain in the IRAMs or in 
30 the cache between two configurations and do not need to be loaded again. 

[0132] Usually the configurations are executed in a loop. Therefore the first table describes the 
first iteration of the example loop^ All configurations are not in the cache, as are the required 
input data. No outputs TABLE US 00076 Data Access Configuration XPP Execute Rof System 
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Spoodup configurations RAM DCacho RAM ICacho Coro Cacho RAM Cacho RAM Core? 
Cacho RAM configuration 1 828 36 9688 1377 366 1377 10516 3621 1152 9.9 2.6 0.1 
configuration 536 17 3021 129 56 129 3560 256 792 1.6 0.6 0.2 configurations 127 16 1736 
215 76 215 2163 192 619 2.5 0.8 0.3 all cfgs 1218 37 11392 2051 198 2051 15610 1072 5290 
5 8.2 2.0 0.3 have been computed so far, so no write-backs take place. 

Data AccessConfigurationXPP ExecuteRef. SvstemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMconfigurationl82836968813773661377105 
16362444529.92.60.4configuration25361730244295642935602567924.60.60.2configuration34 
10 2716173 6245 762452 1 63 1 926 1 92.50.80.3all 

cfgsl218371439220514982051 15610407252908,22,00,3 

[0133] In the second table, the average case is described: All configurations^ are cached in the 
XPP array, as are the input data arrays that can be reused from the previous iteration. Therefore 
the table is missing all instruction transfer cycles. TABLE US 00077 Data Access 

15 Configuration XPP Execute Ref System Speedup configurations RAM DCache RAM ICache 
Core Cache RAM Cache RAM Core Cache RAM configuration 1 1352 52 366 366 1352 3621 
1976 9.9 9.9 3.7 configuration 536 17 56 56 536 256 792 1.6 16 1.5 configurations 760 32 76 
76 760 192 952 2.5 2.5 1.3 all cfgs 1110 53 198 198 1110 1072 5512 8.2 8.2 3.8 
Data AccessConfigurationXPP ExecuteRef SystemSpeedupconfigurationsRAMDCacheRAM 

20 ICacheCoreCacheRAMCacheRAMCoreCacheRAMconfigurationl 13525236636613523624497 
69.99.93.7configuration25361756565362567924.64.61.5configuration37603276767601929522. 
52,51,3all cfgsl440534984981440407255128.28,23,8 

[0131] This is repeated for all loops in the example. For some examples, no outer loop exists. In 
this case, the sub-optimal linear case is described as well as ..the case that the given function is 
25 called within a typical loop. 

5.1 3.timcs. 3x3 Edge Detector 
[0135] 5.1.1 Original Code TABLE US 00078 
The following is source code: 
3 0 #defme VERLEN 1 6 
#defme HORLEN 16 
main( ) { 

int v, h, inp; 

int p 1 [VERLEN] [HORLEN] ; 
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int p2[VERLEN][HORLEN]; 
int htmp, vtmp, sum; 



for(v=0; v<VERLEN; v++) // loop nest 1 

for(h=0; h<HORLEN; h++) { 

scanf("%d", &pl [v][h]); // read input pixels to pi 
p2[v][h] = 0; // initialize p2 

} 



for(v=0; v<=VERLEN-3; v++) { // loop nest 2 
for(h=0; h<=HORLEN-3; h++) { 

htmp = (p 1 [v+2] [h] - p 1 [v] [h]) + (p 1 [v+2] [h+2] p 1 [v] [h+2]) + 2 

* (pl[v+2][h+l] p 1 [v] [h+ 1]); if (htmp <0) htmp- htmp; vtmp = (pl[v][h+2] pl[v][h]) + 
(pl[v+2][h+2] pl[v+2][h]) + 2 * (pl[v+l][h+2] pl[v+l][h]); if (vtmp < 0) vtmp - vtmp; 
sum ~ htmp + vtmp; if (sum > 255) 

(t>llv+2irh+21 - pirvirh+21^) + 

2 * (pirv+2irh+ll - 

pirvirh+11): 

ifOitmp < 0) 

htmp = -htmp; 

vtmp = (pirvirh+2i - pirvirhi) + 

(pirv+2irh+2i - pirv+2irifn + 

2 * (pirv+nrh+2i - 

pirv+lirhl); 

if (vtmp < 0) 

vtmp = -vtmp; 



sum = htmp + vtmp; 



if(sum>255) 



sum = 255; 
p2[v+l][h+l] = sum; 
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for(v=0; v<VERLEN; v++) 

for(h=0; h<HORLEN; h++) 

printf("%d\n", p2[v][h]); 



// print output pixels from p2 



// loop nest 3 



} 



5 



5A.2 Preliminary Transformations 

[0-136] Duo to the calls to tho library functions scanf and printf in loop nost ono and loop nost 
10 throo, rospoctivoly, only loop nost two is handled in tho further sections. 



Basic Transformations 

[0138] The following transformations are done: [0439] Idiom recognition finds the abs( ) and 
min( ) patterns and reduces them to compiler known functions. [0440] Tree balancing reduces 
20 the tree depth by swapping the operands of the additions. [0441] The array accesses are mapped 
to IRAM accesses. [0442] Since this example uses different values of one IRAM within an 
iteration, either shift register synthesis or data duplication must be used. To show the difference 
between these two transformations, both are outlined here. 

25 [0113] The resulting code after this step is shown below. First with shift register synthesis: 




15 



TABLE US 00079 



for(v=0 



I; v<=VERLEN-3; v++) { 
int iram0[128]; //pl[v] 

int iram 1 [ 1 28] ; // p 1 [v+ 1 ] int iram2[128]; // pl[v+2] 
int iram2[1281; // pl[v+21 



30 



int iram3 [ 1 28] ; // p2[v+ 1 ] [ 1 ] 



for(h=0; h<=HORLEN- 1 ; h++) { 
// fill shift registers 
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if (i>l) { tmpOO = tmpOl; tmplO = tmpl 1; tmp20 = tmp21; } 
if (i>0) { tmpOl = tmp02; ; tmp21 = tmp22; } 
tmp02 = iramO[h]; tmpl2 = iraml[h]; tmp22 = iram2[h]; 
if (h>2) { 

htmp = 2 * (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + 

(tmp22 - tmp02); 

htmp = abs(htmp); 

vtmp = 2* (tmpl2 - tmplO) + 
(tmp02 - tmpOO) + 
(tmp22 - tmp20); 

vtmp = abs(vtmp); 

sum = min(255, htmp + vtmp); 

iram3[h-l] = sum; 

} 

R 

1 

[OA A A] And with data duplication: TABLE US 00080 
for(v=0; v<=VERLEN-3; v++) { 

int iram0[128], iraml[128], iram2[128]; // pl[v] 

int iram3[128], iram4[128]; // pl[v+l] 

int iram5[128], iram6[128], iram7[128]; // pl[v+2] 

int iram8 [ 1 28] ; // p2[v+ 1 ] [ 1 ] 

for(h=0; h<=HORLEN-3; h++) { 

tmpOO = iram0[h]; tmplO = iram3[h]; tmp20 = iram5[h]; 
tmpOl = iraml[h+l]; tmp21 = iram6[h+l]; 

tmp02 = iram2[h+2]; tmpl2 = iram4[h+2]; tmp22 = iram7[h+2]; 
htmp = 2* (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + 

(tmp22 - tmp02); 
htmp = abs(htmp); 
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vtmp = 2 * (tmpl2 - tmplO) + 

(tmp02 - tmpOO) + 
(tmp22 - tmp20); 

5 vtmp = abs(vtmp); 

sum = min(255, htmp + vtmp); 
iram3[h-l] = sum; 

} 

} 

10 

[0115] The following table shows the estimated utilization and performance values. TABLE 
US 00081 Value (data Parameter Value 

ParameterValue (shift register synthesis Walue (data duplication) Vector length 16 16 
Reused data set size 32 32 3232 I/Q IRAMs 3 1 + 1 0~ 1 81 + 1 0~9 ALU 
1 5 + 10=481+1 0=9 ALU 8 (calc) + 3*2 (compare for 8 (calc) shift 

register synthesis) = 11 BREG 10 (BREGSUB/ 10 (BREGSUB/ BREG ADD) BREG ADD) 
F-R-EQ-3 -148 fcalc)BREG 1 0 (BREG SUB/BREG ADDMO 

(BREG SUB/BREG ADDWREG3 * 2 = 6 (shift register few-synthesis) Dataflow fewDataflow 
graph width 12 12 1212 Dataflow graph height 3 (shift registers) + 8 
20 (calculation^ (calculation) (calculation) Configuration cycles 1 1 + 16~27 8 + 16~21 
16=278+16=24 

[0116] The inner loop calculation dataflow graph is shown in FIG. 15. The inputs are either 
connected over the shift register network shown in FIG. 16, or directly to an own IRAM. 

25 5.1.3 Enhancing Parallelism 

[0117] The table above shows a utilization of about one fourth of the ALUs. Until now we 
neglected that the SUB and ADD operations can be done by BREGs as well. Therefore we try 
to maximize utilization. 

30 

Unroll-and-Jam 



NY01 1641442 



94 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



[0118] Unroll-and-jam is the transformation of choice, because of its nature to bring iterations 
together. As the reused data size increases, the IRAM usage does not increase proportionally to 
the unrolling factor. 

5 fQ449]-The parameters which determine the unrolling factor are the overall loop count of 14, the 
IRAM utilization of 4 and 9, respectively and the PAE counts. The first parameter allows an 
unrolling degree for unroll-and-jam equal to 2 and 7, while the IRAMs restrict it to 7 and 2 
respectively. The PAE usage would allow an unrolling degree equal to 4 (ALU ADD/SUB 
replaced by BREG ADD/SUB). Therefore the minimum of all factors must be taken, which is 2. 
1 0 The estimated values are shown in the next table TABLE US 00082 Value Value Parameter .. 

ParameterValue (shift register synthesis) Value (data duplication) Vector length 2 * 162* 16-2-*- 
46-Reused data set size 18 18 1 848 I/Q IRAMs 4 I + 2 O = £42-612 I + 2 O = 14 ALU 2*8 + 4 
* 2 = 24-2-242* 8 = 16 BREG^O^O -2020 FREG 4*2 = 8 few Dataflow fewDataflow graph 
1 5 width 12 12 1212 Dataflow graph height 3 (shift registers) + 8 (calculation) 8 (calculation) 



outputs/configuration) outputs/configuration) 

§T4r4-Final Code 

Shift Register Synthesis 

20 

[0150] The RISC code for shift register synthesis after unroll-and-jam reads then: TABLE US 
00083 XppProloadCanfig 
XppPreloadConfig ( XppCfg edge3x3); 
for(v=0; v<=VERLEN-3; v+=2) { 
25 XppPreload(0, &pl[v], 16); 




-Configuration cycles 11 + 16 = 27 (two outputs/configuration) 8 + 16 = 24 (two 



XppPreload(l, &pl[v+l], 16); 
XppPreload(2, &pl[v+2], 16); 
XppPreload(3, &pl[v+3], 16); 



30 



XppPreloadClean(4, @pl[v+l][l], 14]); 
XppPreloadClean(5, @pl[v+2][l], 14]); 
XppExecute( ); 



} 



[0151] The configuration reads as follows: TABLE US 00081 
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void _XppCfg_edge3x3 { 
// IRA Ms 

int iram0[128]; //pl[v] 
intiraml[128]; //pl[v+l] 

int iram2[128]; // pl[v+2] int iram3 [128]; // p 1 [v+3] 
intiram3ri281;//pirv+31 
int iram4[128]; // p2[v+l][l] 
int iram5[128]; // p2[v+2][l] 

for(h=0; h<=HORLEN- 1 ; h++) { 
// fill shift registers 

if (i>l) { tmpOO = tmpOl; tmplO = tmpl 1; tmp20 = tmp21; 

tmp30 = tmp31; } 
if (i>0) { tmpOl = tmp02; tmpl 1 = tmpl2; tmp21 = tmp22; 

tmp31 = tmp32; } 
tmp02 = iramO[h]; tmpl2 = iraml[h]; tmp22 = iram2[h]; 
tmp32 = iram3[h]; 
if (h>2) { 

htmpO = 2 * (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + (tmp22 tmp02); htmpO 

abs(htmpO); 

(tmp22 - tmp02); 

htmpO = abs(htmpO); 

vtmpO = 2 * (tmpl 2 - tmplO) + 

(tmp02 - tmpOO) + (tmp22 tmp20); ; 

vtmpO ~ abs(vtmpO); 

(tmp22 - tmp20); 



vtmpO = absfvtmpO"); 

sumO = min(255, htmpO + vtmpO); 

iram4[h-l] = sumO; 

htmp 1=2* (tmp3 1 - tmpl 1) + 

(tmp30 - tmplO) + 
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(tmp32 - tmpl2); 

htmpl = abs(htmpl); 

vtmp 1=2* (tmp22 - tmp20) + 

(tmpl2 - tmplO) + (tmp32 tmp30); 

vtmpl - abs(vtmpl); 

(tmp32 - tmp3Q); 



vtmpl = absfvtmpl); 



suml = min(255, htmpl + vtmpl); 
iram5 [h-1] = suml; 

} 

} 

} 



Data Duplication 



[0152] Data duplication needs more preloads. TABLE US 00085 

XppPreloadConfig(_XppCfg_edge3x3); 

for(v=0; v<=VERLEN-3; v+=2) { 

XppPreload(0, &pl[v], 16); 

XppPreload( 1 , &p 1 [v] , 1 6); 

XppPreload(2, &pl[v], 16); 

XppPreload(3, &pl[v+l], 16); 

XppPreload(4, &pl[v+l], 16); 

XppPreload(5, &pl[v+l], 16); 

XppPreload(6, &pl[v+2], 16); 

XppPreload(7, &pl[v+2], 16); 

XppPreload(8, &pl[v+2], 16); 

XppPreload(9, &pl[v+3], 16); 

XppPreload(10, &pl[v+3], 16); 

XppPreload(ll, &pl[v+3], 16); 

XppPreloadClean(12, @pl[v+l][l], 14]); 

XppPreloadClean(13, @pl[v+2][l], 14]); 

XppExecute( ); 
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} 



[0153] On the other hand the configuration is less complex. TABLE US 00086 
void _XppCfg_edge3x3 { 
// IRA Ms 

int iram0[128], iraml[128], iram2[128]; // pl[v] 
int iram3[128], iram4[128] iram5[128]; //pl[v+l] 
int iram6[128], iram7[128], iram8[128]; // pl[v+2] 
int iram9[128], iraml0[128], iraml 1[128]; // pi [v+3] 
intiraml2[128];//p2[v+l][l] 
int iraml3[128]; // p2[v+2][l] 

for(h=0; h<=HORLEN-3; h++) { 



tmpOO = iram0[h]; tmplO = iram3[h]; 
tmp20 = iram6[h]; tmp30 = iram9[h]; 
tmpOl = iraml[h+l]; tmpll = iram4[h+l]; 
tmp21 = iram7[h+l]; tmp31 = iraml0[h+l]; 
tmp02 = iram2[h+2]; tmpl2 = iram5[h+2]; 
tmp22 = iram8[h+2]; tmp32 = iraml l[h+2]; 
htmpO = 2* (tmp21 - tmpOl) + 



vtmpO = abs(vtmpO); 

sumO = min(255, htmpO + vtmpO); 

iraml 2 [h] = sumO; 

htmp 1 = 2* (tmp3 1 - tmp 11) + 

(tmp30 - tmp 10) + 
(tmp32 - tmp 12); 



(tmp20 - tmpOO) + 
(tmp22 - tmp02); 



htmpO 
vtmpO 



abs(htmpO); 

2 * (tmp 12 - tmplO) + 



(tmp02 - tmpOO) + 
(tmp22 - tmp20); 
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htmpl = abs(htmpl); 

vtmp 1 = 2* (tmp22 - tmp20) + 

(tmpl2 - tmplO) + 
(tmp32 - tmp30); ; 

5 vtmpl = abs(vtmpl); 

suml = min(255, htmpl + vtmpl); 
iraml3[h] = suml; 

} 

} 

10 

5.1.5 Performance Evaluation 



[0151] The next two tables list the estimated performance of data transfers. The values consider 
the data reuse, which means that after the startup, which preloads 4 picture rows, each iteration 

15 only advances two picture rows. Therefore two rows are reused and stay in the cache. TABLE 
US 00087 Cache to IRAM Size Cache RAM to Cache [cache Data [bytes] Misses [cache 
cycles] cycles] Startup preloads pl[v] 61 2 1 12 1 pl[v + 1] 61 2 1 12 1 pl[v + 2] 61 2 1 12 1 
pl[v + 3] 61 2 112 1 Sum 118 16 Steady State Preloads pi [v] (reuse p[v + 2]) 61 0 1 pl[v+ 1] 
(rouse p[v + 61 0 1 3])pl[v + 2] 61 2 112 1 pl[v + 3] 61 2 112 1 Sum 221 16 Steady State 

20 Writebacks p2[v + 1] 56 2 176 1 p2[v + 2] 56 2 176 1 Sum 352 8 



DataSize [bytes]Cache MissesRAM to Cache [cache cycles]Cache to IRAM [cache 
cycles] Startup 

Preloadsp 1 M642 1 1 24t> 1 [v+ 1 1642 1 1 24t> 1 [v+21642 1 1 24p 1 [v+3 1642 1 1 24Sum448 1 6Steadv 
25 State Preloadsp lTv] (reuse p[v+21)6404pl[v+l](reuse 

p[v+31)6404pHv+2]6421 124pl[v+316421 124Sum22416Steadv State 
Writeb acksp2 r v+ 1 1 5 62 1 7 64p2 \ v+21 5 621764 Sum3 528 

[0155] For data duplication the following transfer statistics are estimated. The table accounts for 
the tripled data transfers between cache and IRAMs. TABLE US 00088 Cache to IRAM Size 
30 Cache RAM to Cache [cache Data [bytes] Misses [cache cycles] cycles] Startup preloads pl[v] 
(3 times) 61 2 112 12pl[v+ 1] (3 times) 61 2 112 12pl[v + 2] (3 times) 61 2 112 12pl[v + 3] 
(3 times) 61 2 1 12 12 Sum 118 18 Steady State Preloads pl[v] (reuse p[v + 2], 61 0 12 3 times) 
pl[v + 1] (reuse p[v + 61 0 12 3], 3 times) pl[v + 2] (3 times) 61 2 1 12 12 pl[v + 3] (3 times) 
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612 112 12 Sum 221 18 Steady Stato Writebacks p2[v + 1] 56 2 61 1 p2[v + 2] 56 2 61 1 Sum 
408-8 

DataSize TbyteslCache MisscsRAM to Cache Tcache cycleslCache to I RAM [cache 
5 cvclesl Startup Preloadsp 1 M (3 times^6421 1212pirv+11 (3 times^6421 1212pirv+21 (3 

times)64211212pirv+31 (3 times^6421 1212Sum44848Steadv State PreloadspllvKreuse plv+21. 
3 times)640 12pllv+ 11 (reuse t)lv+31. 3 times)64012t)llv+21 (3 times)6421 1212pllv+31 (3 
times)6421 1212Sum22448Steadv State Writebacksp2rv+H562644p2|v+21562644Suml288 
[0156] Both configurations, representing the loop, are hand coded in NML and mapped and 
10 simulated with the XDS tools. 

[0157] The simulation yields— scaled to the cache frequency— 124 and 144 cycles, respectively. 
This is remarkable in so far, that we expected the variant with data duplication would produce 
better results. It seems that the duplicated IRAMs cause a worse routing. 

15 

[0158] The performance comparison of the two configurations with the reference system yields 
the results in the following table. The first two rows of a section list the startup state and the 
steady state of the v-loop. Since the v-loop ha a trip count of 7, the columns sum calculate to 
startup state+7* steady state. All values assume worst-case performance, i.e. that configuration 

20 preload cannot be hidden and that no data is in the cache. TABLE US 00089 Data Access 

Configuration XPP Execute Ref. System Speedup configurations RAM DCache RAM ICache 
Core Cache RAM Cache RAM Core Cache RAM shift register synthesis odgo3 .times. 3 startup 
118 16 2296 1290 0 1290 2711 odgo3 .times. 3 steady 352 21 0 0 121 121 352 sum 2912 868 
2158 5208 5628 8510 6.5 2.6 1.6 data duplication cdgc3 .times. 3 startup 118 18 1818 1019 0 

25 1019 2296 cdgc3 .times. 3 steady 352 56 0 0 111 111 352 sum 2912 1008 2057 1760 5628 8510 
5.6 2.7 1.8 

Data AccessConfigurationXPP ExecuteRef SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMshift register svnthesisedge3x3 
startup448 1 62296 129001 2902744edge3x3 
30 steadv3522400124124352sum291286821585208562885406.52.61.6dataduplicationedge3x3 
startup44848 18481 0490 1 0492296edge3x3 

steadv3525600144144352sum2912100820574760562885405.62.71.8 

[0159] The results show the dominance of the configuration preload. Although the core 

performance of the case using data duplication is worse than the case using shift register 
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synthesis, this is neglectable for the values including the memory hierarchy. The next table 
assumes that configuration preload can be issued early enough, so it can be hidden and must not 
be taken into account. TABLE US 00090 Data Access Configuration XPP Execute Ref System 
Speedup configurations RAM DCache RAM ICache Core Cache RAM Cache RAM Core 
5 Cache RAM shift register synthesis cdgc3 .times. 3 startup 448-4-6-0-0-4-6-448 cdgo3 .times. 3 
steady 352 21 0 121 121 352 sum 2912 868 881 2912 5628 8510 6.5 6.1 2.9 data duplication 
cdgc3 .times. 3 startup 118180018118 cdgc3 .times. 3 steady 352 56 0 111 111 352 sum 
2912 1008 1056 2912 5628 8510 5.6 5.3 2.9 

10 Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMshift register synthesisedge3x3 
startut>448 1 600 1 6448edge3x3 

Steadv352240124124352sum29128688842912562885406.56,42,9dataduplicationedge3x3 
startup448480048448edge3x3 
15 steadv352560144144352sum2912100810562912562885405,65,32.9 

[0160] The results again show the impact of the configuration preload for configurations that 
calculate small or medium amounts of data. When it can be hidden, performance is almost 
doubled in this example. 

20 [0161] The comparison to the reference system shows less improvement compared to other 
examples. The reason is the short vector length. Nevertheless pictures of size 16.times.26 are 
not very common, thus we expect better improvements in the next section, which embeds the 
algorithm in a parameterized function. 

25 [0162] The final utilization is shown in the next table. As the estimations did not account for 

counters and other controlling networks, the values for BREGs and FREGs differ significantly. 
TABLE US 00091 Value 

ParameterValue (shift Parameter register synthesis) Value (data duplication) Vector length 2 * 
30 46^162 * 16 Reused data set size 18 18 4848 I/Q IRAMs [sum-pet] 6 - 38% 14 - 88% 

ALU[sum-pct] 33 - 52% 19- 30% BREG [def/route/sum-pct] 34/14/58-73% 36/20/56 -70% 
FREG [def/route/sum-pct] 25/27/52 - 65% 9/38/47 - 59% 
5.1.6 Parameterized Function 
Source Code code 
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[0^163] The benchmark source code is not very likely to be written in that form in real world 
applications. Normally! it would be encapsulated in a function with parameters for input and 
output arrays along with the sizes of the picture to work on. 

5 [0 / l61] Therefore the source code would look similar to the following : TABLE US 00092 
void edge3x3(int *pl, int *p2, int HORLEN, int VERLEN) 

{ 

for(v=0; v<=VERLEN-3; v++) { 
for(h=0; h<=HORLEN-3 ; h++) { 
1 0 htmp = (* *(p 1 + (v+2) * HORLEN + h) ♦ *(p 1 + v ♦ HORLEN + h)) + 

(**(pl + (v+2) * HORLEN + h+2) **(pl + v * HORLEN + h+2))+ 2 * (**(pl + (v+2) * 
HORLEN + h+1) **(pl + v * HORLEN + h+1)); if (htmp < 0) htmp - htmp; vtmp - (**(pl 
+ v * HORLEN + h+2) **(pl + v * HORLEN + h)) + (**(pl + (v+2) ♦ HORLEN + h+2) 
**(pl + (v+2) * HORLEN + h))+ 2 * (**(pl + (v+1) * HORLEN + h+2) **(pl + (v+1) * 

15 HORLEN + h)); if (vtmp < 0) vtmp = — vtmp; sum = htmp + vtmp; = **(pl + v * 

HORLEN + h)) + 

(**(pl + (v+2) * HORLEN + h+2) - **(pl+y* 

HORLEN + h+2)) + 

2^ (**fpl + (V+2) * HORLEN + h+1) - """(pi 

20 + v * HORLEN + h+D); 

if (htmp < 0) 

htmp =htmp; 

vtmp = (* *(p 1 + v * HORLEN + h+2) - * *(p 1 + v * HORLEN + h)) 

+ 

25 (**(pl + (v+2) * HORLEN + h+2) -**(pl + (v+2) * HORLEN + 

mt 

2 * (**(pl + (v+1) * HORLEN + h+2) -**(pl + (v+1) * 

HORLEN + h)); 

if (vtmp < 0) 

30 vtmp =vtmp; 

sum = htmp + vtmp; 

if (sum > 255) 

sum = 255; 
** (p2 + (v+1) * HORLEN + h+1) = sum; 
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} 



R 



1 



5 5.1.7 Transformations 

[0165] In addition to the transformations presented in section 5.4.2, this requires some 
additional features from the compiler. [0166] 

* Loop tiling assures that the IRAM size is not exceeded, and that the cache content is reused. 
10 In this example the algorithm must assure that the tiles overlap. FIG. 17 shows, that although 

the tile size must be 128, the loops that advance the tile must have step sizes of 125, otherwise 
the grey border edges would not be handled. The final tile size is computed by the RISC and 
passed to the array. [0167] 

* As the unroll-and-jam algorithm needs iteration counts which are a multiple of 2, a guarded 
15 peeled off first iteration is inserted, which calculates the values either on the RISC or in an own 

configuration. 

[0168] The loop nest reads then as follows. We show only the variant with shift register 
synthesis, with the loop body omitted for better reading. As stated above, the tile size is 128 
20 (IRAM size), but the tile advancing loops increase by 125, overlapping the tiles correctly. The 
loop body equals the one in 5.4.4 (Shift Register Synthesis). TABLE US 00093 



30 #T4r8-Final Code 

[0169] In addition to the simple variant, the final tile size of the innermost loop has to be passed 
to the array. Therefore the RISC code reads as follows, where the body of the guarded first 
iteration for odd tile sizes is omitted for simplicity. TABLE US 00091 



for (v=0: v < : 
for(h= 



= VERLEN-3; v+= 125) 
0; h <= HORLEN-3; h+= 125) 
for (w=v; w< min(v+ 127, VERLEN-2); v+=2) 



25 



for(hh=h; hh<min(h+ 127, HORLEN-2); hh++) { 



} 
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XppPreloadConfig(_XppCfg_edge3x3); 
for (v=0: v <= VERLEN-3; v+= 125) 

for(h=0; h <= HORLEN-3; h+= 125) { 
5 vtilesize = min(128, VERLEN - v); 

if (v_tilesize & 1 != 0) { 

// calculate line on RISC 
v++; tilesize &= 1 ; 

} 

1 0 for (w=v; vc< v + v_tilesize; v+=2) { 

tilesize = min(128, HORLEN-h); 

XppPreload(0, &pl[w][h], tilesize); 

XppPreload(l, &pl[w+l][h], tilesize); 

XppPreload(2, &pl[w+2][h], tilesize); 
15 XppPreload(3, &pl [w+3][h], tilesize); 

XppPreloadClean(4, @pl[w+l][h+l], tilesize - 2]); 

XppPreloadClean(5, @pl[w+2][h+l], tilesize - 2]); 

XppPreload(6, &tilesize, 1); 

XppExecute( ); 

20 } 

[0170] The configuration reads then. TABLE US 00095 

void _XppCfg_edge3x3 { 
25 // IRAMs 

int iram0[128]; //pl[w] 
intiraml[128]; //pl[w+l] 

int iram2[128]; // pl[w+2] int iram3[128]; // pl[vv+3] 
int iram3ri281; // pirw+31 
30 intiram4[128];//p2[w+l][h+l] 
int iram5[128]; // p2[w+2][h+l] 
int iram6[128]; // tilesize 
for(h=0; h<=iram6[0]; h++) { 
// fill shift registers 
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20 



25 



if (i>l) { tmpOO = tmpOl; tmplO = tmpl 1; tmp20 = tmp21; 

tmp30 = tmp31; } 
if (i>0) { tmpOl = tmp02; tmpl 1 = tmpl2; tmp21 = tmp22; 

tmp31 = tmp32; } 
tmp02 = iramO[h]; tmpl2 = iraml[h]; tmp22 = iram2[h]; 
tmp32 = iram3[h]; 
if (h>2) { 

htmpO = 2 * (tmp21 - tmpOl) + 

(tmp20 - tmpOO) + (tmp22 tmp02); htmpO 



10 abs(htmpO); 



15 abs(vtmpO); 



abs(vtmpl); 



(tmp22 - tmp02); 



htmpO = abs(htmpO); 



vtmpO = 2 * (tmpl 2 - tmplO) + 

(tmp02 - tmpOO) + (tmp22 — tmp20); vtmpO 



(tmp22 - tmp20); 



vtmpO = absfvtmpO); 



sumO = min(255, htmpO + vtmpO); 
iram4[h-l] = sumO; 

htmp 1=2* (tmp3 1 - tmpl 1) + 

(tmp30 - tmplO) + 
(tmp32 - tmpl 2); 

htmpl = abs(htmpl); 

vtmp 1=2* (tmp22 - tmp20) + 

(tmpl2 - tmplO) + (tmp32 — tmp30); ; vtmpl 



(tmp32 - tmp30); 



30 



vtmpl = abs(Vtmpl); 



suml = min(255, htmpl + vtmpl); 
iram5[h-l] = suml; 
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[0171] The estimated utilization and worst-case performance (full tile) is shown below. 
TABLE US 00096 Parameter Value Vector 

5 

ParameterValueVector length 2 * 128 Reused data set size 384 I/O IRAMs 5 I + 2 O = 7 ALU 2 
*8 + 4*2 = 24 BREG 20 FREG 4*2 = 8 Dataflow graph width 12 Dataflow graph height 3 
(shift registers) + 8 (calculation) Configuration cycles 11 + 128 = 139 
#r4r9-Performance Evaluation 

10 

[0172] We assume a 750.times.500 pixels picture similar to that shown in FIG. 17. We choose 
the size to simplify measurements since the dimensions are both multiples of 125. The 
estimated data transfer performance is shown in the table below. 

15 [0173] When computation of a new tile is begun (startup case), the first four rows must be 
loaded from RAM to the cache. During execution of the inner loop (steady state case, 
abbreviated steady) only two rows/iteration must be loaded. Since the output IRAMs are 
preloaded clean, no write allocation takes place. TABLE US 00097 IRAM Size Cache RAM to 
Cache [cache Data [bytes] Misses [cache cycles] cycles] Startup preloads pl[w] 512 16 896 32 

20 pl[w+ 1] 512 16 896 32pl[w + 2] 512 16 896 32pl[w + 3] 512 16 896 32 Sum 3581 128 
Steady State Preloads pl[w] (reuse p[w + 2]) 512 0 32 pl[w + 1] (reuse p[w + 512 0 32 3]) 
pl[w + 2] 512 16 896 32 pl[w + 3] 512 16 896 32 Sum 1792 128 Steady State Writebacks 
p2[w+ 1] 501 512 32 p2[w + 2] 501 512 32 Sum 1021 61 

DataSize TbyteslCache MissesRAM to Cache Tcache cycleslIRAM [cache cycles] Startup 
25 Preloadsr>irwl5121689632r>l[w+115121689632r>irw+215121689632pl[w+315121689632Su 
m3584128Steadv State Preloadspl[w1(reuse prw+21)512032pl[w+l](reuse 
prw+31)512032pirw+215121689632pirw+315121689632Suml792128Steadv State 
Writebackst>2 rw+ 1 1 5 045 1 232t>2 r w+21 5 045 1 232Sum 1 02464 

30 [0171] The simulation yields a cache cycle count of 496 per two rows of a tile. To compare the 
values with the reference system we calculate 24 (tiles) *(startup+63* steady) for, each value. 
Since the configuration takes place only once, it is mentioned in an own row of the following 
table, and involved without a factor in the summation. TABLE US 00098 Data Access 
Configuration XPP Execute Rof System Speedup configurations RAM DCacho RAM ICacho 
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Coro Cacho RAM Cacho RAM Core? Cache? RAM odgo3x3 config 2161 1108 1108 2161 
odgo3x3 startup 3518 128 128 3518 odgo3x3 steady 2816 192 196 196 2816 sum 1312911 
719952 751132 1315108 8577321 12920268 11.1 11.1 3.0 

5 Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMedge3x3 
config2464140814082464edge3x3 startup35481281283548edge3x3 

steadv28 1 6 1 9249649628 1 6sum434294474995275443243454088577324 12920268 1 1 .4 1 1 .43 .0 
[0175] Finally the overall utilization is shown in the following table. As mentioned above, the 
1 0 big differences for FREGs and BREGs stem from the missing estimations for counter and 
controlling PAEs. TABLE US 00099 Parameter Value Vector 

ParameterValueVector length 2 * 128 Reused data set size 384 I/O IRAMs [sum -pet] 7 - 44% 
ALU[sunvpct] 27-43% BREG [def/route/sum-pct] 41/21/62 - 78% FREG [def/route/sum-pct] 
25/34/59- 74% 

15 

5t5-FIR Filter-SrSri 

Original Code 
[0176] Source Code: TABLE US 00100 code: 

#defmeN256 
20 #defme M 8 

intx[N],y-Y[N]; 

const int c [M4M1 =; { 2, 4, 4, 2, 0, 7, ^-5, 2 }1; 
main() 4 { inti, j ; 

25 

/* code for loading x */ 

for (i = 0; i < N-M+l; i++) { y[i] - 0; // S 
S: yli] = 0; 
30 for(j = 0;j<M;j++) 

S': y[i] += c[j] * x[i+M-j-l§); // S' ] /* 
I 

code' for ^storing ^y- 4 / 
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} 

[0177] The constants N and M are replaced by their values by the pre-processor. The data 
dependence dependency graph is the following : TABLE US 00101 . 
for (i = 0; i < 249; i++) { 
5 S: y[i] = 0; 

for(j = 0;j<8;j++) 
S': y[i] += cD"] * x[i+7-j]; 

} 

10 [0178] Wo havo tho The following is a corresponding table: TABLE US 00102 Parameter Value 
Vector length input 

Par ameterV alue Vector lengthinnut : 8, output: 1 Reused data set size — I/O IRAMs 3 ALU 2 
BREG 0 FPvEG 0 Dataflow Data flow graph width 1 Dataflow 1 Data flow graph height 2 
Configuration cycles 2 + 8 = 10 

1 5 [0179] Increasing the amount of parallelism available in a loop implies to increase the amount 
of memory needed to achieve the computations of the optimized loop body. In this case, the 
maximal parallelism is obtained when all multiplications of the inner loop are done in parallel, 
and the inner loop is completely unrolled. This way, 8 elements of array x are needed at each 
cycle. This is only possible by using data duplication, which means that all 16 IRAMs (2 

20 IRAMS for each copy of array x) are needed to store array x, and consequently array y has to be 
output directly on the output port. Running a configuration— that uses only 8 IRAMs for input- 
twice would be another way to process the 256 values of array x. 

[0180] The latter is possible in this case as array y is a global variable, but it won't be possible if 
25 it would be parameter of a function, as it is usually the case. Moreover, as the same data must 
be loaded in the different IRAMs from the cache for array x, we have a lot of transfers to 
achieve before the configuration can begin the computations. The performance of this algorithm 
is bounded by memory access times and thus there is no need to maximize parallelism. For this 
reason, the solution chosen by the compiler is to extract less parallelism to release the pressure 
30 on the memory hierarchy. It is presented in the next section. Nevertheless the more parallel 
solution is also presented to have a point of comparison. 

5.5.2 Solution Chosen by the Compiler 
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[0181] To find some parallelism in the inner loop, the straightforward solution is to unroll the 
inner loop. No other optimization is applied before as either they do not have an effect on the 
loop or they increase the need for IRAMs. After loop unrolling, we obtain the following code: 
TABLE US 00103 

5 

for (i = 0; i < 249; i++) { 
y[i] = 0; 

y[i] += c[0] * x[i+7]; 

y[i] += c[l] * x[i+6]; y[i] 4= c[2] * x[i+5]; y[i] +~ c[3] * x[i+1]; y[i] +~ c[1] * x[i+3]; 
10 y[i] +- c[5] * x[i+2]; y[i] +- c[6] * x[i+l]; y[i] +- c[7] * x[i]; 



y[i] += 


c r2i * 


x[i+51; 


y[i] += 


c[31* 


x[i+4]; 


vrii += 


cr4i * 


xri+31; 


yril += 


crsi * 


xri+21; 


vm += 


cr6i * 


xri+11; 


yril += 


cr7i * 


x[il; 



} 



[0182] Then the parameter table looks like this: TABLE US 00101 Parameter Value Vector 
20 length input 

Par ameterV alue Vector lengthinput : 256, output: 249 Reused data set size — I/O IRAMs 5 ALU 
16 BREG 0 FREG 0 Dataflow graph width 2 Dataflow graph height 9 Configuration cycles 9 + 
249 = 258 



25 [0183] Dataflow analysis reveals that y[0]=f(x[0], . . . ,x[7]), y[l]=f(x[l], . . . ,x[8]), . . . 

,y[i]=f(x[i], . . . ,x[i+7]). Successive values of y depend on almost the same successive values of 
x. To prevent unnecessary accesses to the IRAMs, the values of x needed for the computation of 
the next values of y are kept in registers. In our case this shift register synthesis needs 7 
registers. This will be achieved on the PACT XPP, by keeping them into FREGs. Then we 

30 obtain the dataflow graph depicted below. An IRAM is used for the input values and an IRAM 
for the output values. The first 9 cycles are used to fill the pipeline and then the throughput is of 
one output value/cycle. Furthermore, each array will be stored in two IRAMs, which be linked 
to each other. The memories will be accessed in FIFO mode. This is depicted as "FIFO 
pipelining", and avoid to apply loop tiling to make the amount of memory needed to the 
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IRAMs, when the size of the array is smaller than the total amount of memory available on the 
PACT XPP. The code becomes the following after shift register synthesis: TABLE US 00105 



cO = c[0]; 

cl = c[l]; c2 ~ c[2]; c3 - c[3]; e4 - c[1]; c5 - c[5]; c6 - c[6]; c7 - c[7]; rO - x[0]; rl - x[l]; r2 
- x[2]; r3 ~ x[3]; H - x[1]; r5 ~ x[5]; r6 ~ x[6]; r7 ~ x[7]; 



c3 


= cm; 


c4 


= cr41; 


c5 


= cr51; 


c6 


= cr61; 


c7 


= cr71; 



rO = xroi 



rl = 


xril; 


r2 = 


xr21; 


r3 = 


xPl; 


r4 = 


xr41: 


r5 = 


x[51; 


r6 = 


x[61; 


r7 = 


x[71; 



for (i = 0; i < 249; i++) { 

y[i] = c7*r0 + c6*rl + c5*r2 + c4*r3 + c3*r4 + c2*r5 + cl*r6 + c0*r7; 
r0 = rl; 
rl =r2 
r2 = r3 
r3 = r4 
r4 = r5 



r5 = r6 



r6 = r7; 
r7 = x[i+7]; 



r1 — r5; r5 — r6; 
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[0^181] And after FIFO pipelining, the code is transformed like below, where xl and x2 
represents the parts of x, which are loaded in different IRAMs, the same for yl and y2 with 
respect to array y. TABLE US 00106 



5 int *piramO_l,*piraml_l; 

piram0_l = &xl[0]; 
piramll = &yl[0]; 

10 for (i = 0;i < 249;i++) 
{ 

r0 = rl; 

rl = r2; r2 ~ r3; r3 ~ rl; rl ~ r5; r5 ~ r6; r6 ~ r7; r7 - xl++; if (i < 128) piram0_l++ ~ 
x2++; else- if (i — 128) xl - &xl[0]; yl++ - c7*r0 + c6*rl + c5*r2 + c1*r3 + c3*r1 + c2*r5 + 
15 cl*r6 + c0*r7; if (i < 128) y2++ = piraml_l++; else if (i — 128) yl - &yl[0]; 
r2 = r3: 

r3 = r4; 

r4 = r5; 

r5 = r6; 

20 r6 = r7; 

r7 = xl++; 

if q < 128) 

piramO 1++ = x2++; 

25 else 

if (i == 128) 

xl =&xir01: 

vl++ = c7*r0 + c6*rl + c5*r2 + c4*r3 + c3*r4 + c2*r5 + cl*r6 + c0*r7: 

30 if (i< 128) 

y2++ = piraml_l++; 

else 

if (i = 128) 

yl =&vir01; 
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} 



[0185] The dataflow graph representing the loop body is shown in FIG. 18. 

[01 86] The final parameter table is shown below: TABLE US 00 1 07 Parameter Value Vector 
length input 

Par ameterV alue Vector lengthinput : 256, output: 249 Reused data set size — I/O IRAMs 4 ALU 

15 BREG 0 FREG 7 Dataflow graph width 3 Dataflow graph height 9 Configuration cycles-9-+ 

219- 9+249= 258 

Variant with Larger Loop Bounds 

[0187] Let us take larger loop bounds and set the values of N and M to 2048 and 64. TABLE 
US 00108 

for(i = 0; i< 1985; i++) { 
y[i] = 0; 

forG' = 0;j<64;j++) 

y[i] += eft] * x[i+63-j]; 

} 

[0188] The loop nest needs 17 IRAMs for the three arrays, which makes it impossible to 
execute on the PACT XPP. Following the loop optimizations driver given before, we apply loop 
tiling to reduce the number of IRAMs needed by the arrays, and the number of resources needed 
by the inner loop. We use a size of 512 for x and y, and 16 for c. Theoretically, we could have 
taken bigger sizes, and occupy more IRAMs, but subsequent optimizations will need more 
IRAMs. This can already be stated, as the amount of parallelism in the innermost loop is low, 
and to increase it more resources will be needed, therefore we must take smaller sizes. We 
obtain the following loop nest, where only 9 IRAMs are needed for the loop nest at the second 
level. TABLE US 00109 

for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min(512,1985-ii*512); i++) { 
y[i+512*ii] = 0; 
for (jj = 0; jj < 4; jj++) 

forG = 0;j<16;j++) 
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} 



y[i+512*ii] += c[16*jj+j] * x [i+5 1 2 * ii+63 - 1 6 *jj -j ] ; 



[0189] A subsequent application of loop unrolling on the inner loop yields: TABLE US 00110 
5 for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min(512,1985-ii*512); i++) { 



yLHOiz u j — u, 




tor UJ ~~ u ' JJ < 4 > JJ 


++ ) \ 


y[i^j 1 Z 11 J 


+— C[lt) JJ J X[H"31Z lli-OJ-lO JJJ, 


y[i+j IZ nj 


+— C[10 jj+lj X[l+jlZ 11+OZ-lD JJJ, 


y[i+j iz ii j 


CLIO JJ+ZJ X[l+Jlz 11+Ol-lD JJJ, 


y[i^j i z u j 


C[10 JJtjJ X[1t^j1Z llT^OU-lD JJJ, 


y[i+j iz n j 


CLio jj+^j x[1tjiz moy-io JJJ, 


y[i+j iz n j 


+— CLio jj+jJ xli+jIZ li+Do-io jjj, 


y[i+512*ii] 


+= c[16*jj+6] * x [i+5 1 2 * ii+5 7- 1 6 * jj ] ; 


y[i+512*ii] 


+= c[16*j'j+7] * x[i+5 12*ii+56-16*jj]; 


y[i+512*ii] 


+= c[16*jj+8] * x[i+5 12*ii+55-16*jj]; 


y[i+512*ii] 


+= c[16*jj+9] * x [i+5 1 2 * ii+5 4- 1 6 * jj ] ; 


y[i+512*ii] 


+= c[16*jj+10] * x [i+5 1 2 * ii+5 3 - 1 6 *j j ] ; 


y[i+512*ii] 


+= c[16*j'j+ll] * x [i+5 1 2 * ii+5 2- 1 6 *j j ] ; 


y[i+512*ii] 


+= c[16*j'j+12] * x [i+5 1 2 * ii+5 1 - 1 6 *j j ] ; 


y[i+512*ii] 


+= c[16*j'j+13] * x [i+5 1 2 * ii+5 0- 1 6 *j j ] ; 


y[i+512*ii] 


+= c[16*j'j+14] * x[i+5 1 2 *ii+49- 1 6*jj] ; 


y[i+512*ii] 


+= c[16*jj+15] * x [i+5 1 2 * ii+4 8-16 *j j ] ; 



25 } 

} 

[0190] Finally we obtain the same dataflow graph as above, except that the coefficients must be 
read from another IRAM rather than being directly handled like, constants by the 
30 multiplications. After shift register synthesis the code is the following: TABLE US 00111 

for (ii = 0;ii < 4;ii++) 

for (i = 0; i < min(512,1985-ii*512); i++) { 
r0 = x[i+512*ii+48]; 
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rl =x[i+512*ii+49]; 

r2 = x[i+512*ii+50]; 

r3 = x[i+512*ii+51]; 

r4 = x[i+512*ii+52]; 
5 r5 = x[i+512*ii+53]; 

r6 = x[i+512*ii+54]; 

r7 = x[i+512*ii+55]; 

r8 = x[i+512*ii+56]; 

r9 = x[i+512*ii+57]; 
10 rl0 = x[i+512*ii+58]; 

rll = x[i+512*ii+59]; 

rl2 = x[i+512*ii+60]; 

rl3 = x[i+512*ii+61]; 

rl4 = x[i+512*ii+62]; 
15 rl5 = x[i+512*ii+63]; 

for(jj=0;jj<4;jj++) { 

y[i] = c[8*jj]*rl5 + c[8*jj+l]*rl4 + c[8*jj+2]*rl3 + c[8*jj+3]*rl2 + 
c[8*jj+4]*rl 1 + c[8*jj+5]*rl0 + c[8*jj+6]*r9 + 

c[8*jj+7]*r8 + 

20 c[8*jj+8]*r7 + c[8*jj+9]*r6 + c[8*jj+10]*r5 + 

c[8*jj+l l]*r4 + c[8*jj+12]*r3 + c[8*jj+13]*r2 + c[8*jj+l^l]*rl + c[8*jj+15]*r0; rO - rl; rl - 
r2; r2 - r3; r3 ~ H; H ~ r5; r5 ~ r6; r6 ~ r7; r7 ~ r8; r8 ~ r9; r9 ~ rlO; rlO ~ rl 1; rl 1 = rl2; rl2 
-rl3;rl3 -rl1;rl1 ~rl5; 

cr8*ii+121*r3 + cr8*ii+131*r2 + cr8*ii+141*rl + 



25 cr8*ii+151*rO; 



30 



rO = rl; 
rl - r2; 
r2 = r3: 
r3 = r4; 
r4 = r5; 
r5 = r6; 
r6 = r7; 
r7 = r8; 
r8 = r9; 
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r9 = r!0; 



rlO = rll 



rll =r!2; 

r!2 = r!3; 

5 r!3 =r!4: 

r!4 = rl5: 

rl5 = x[i+512*ii+63-8*jj]; 

} 

} 

10 

[0191] The parameter table is then as follows. TABLE US 001 12 Parameter Value Vector 
length input 

Par ameterV alue Vector lengthinput : 8, output: 1 Reused data set size — I/O IRAMs 3 ALU 31 
1 5 BREG 0 FREG 1 5 Dataflow graph width 3 Dataflow graph height 1 7 Configuration cycles 4 + 
17 = 21 

5.5.3 A More Parallel Solution 

[0192] The solution we-presented above does not expose maximal a lot of parallelism in the 
loop. This can bo done by To explicitly parallolizin g parallelize the loop before we 
20 generate generating the dataflow grap h. Of course, as explained before, exposin g can be tried. 
Exposing more parallelism means m ay mean more pressure on the memory hierarchy. 

[0193] In the data dependence graph presented at the boginning above , the only loop-carried 
dependence is the dependence on S' and it is only caused by the reference to y[i]. Hence-we 
25 apply^ node splitting is applied to get a more suitable data dependence graph , and a statement 
that can be parallelized. Wo obtain then: TABLE US 001 13 . Accordingly, the following may 
be obtained: 

for (i = 0; i < 249; i++) { 
y[i] = 0; 

30 for0' = 0;j<8;j++) 

{ 

tmp = c[j] * x[i+7-j]; 
y[i] += tmp; 

R 
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[0191] Then scalar expansion is may be performed on tmp to remove the anti loop-carried 
dependence caused by it, and we have the following code may be obtained : TABLE US 001 11 
5 for (i = 0; i < 249; i++) { 
y[i] = 0; 
for(j = 0;j<8;j++) 
{ 

tmptj] = c[j] * x[i+7-j];-y 
10 Y[i] += tmp[j]; 

H 

1 

[0195] The parameter table is the following: TABLE US 00115 Parameter Value Vector 
15 length input 

Par ameterV alue Vector lengthinput : 8, output: 1 Reused data set size — I/O IRAMs 3 ALU 2 
BREG 0 FREG 1 Dataflow 1 Data flow graph width 2 Dataflow 2Data flow graph height 2 
Configuration cycles 2 + 8 = 10 

[0196] Then wo apply loop Loop distribution may then be applied to get a vectorizable and a not 
20 vectorizable loop. TABLE US 00116 
for (i = 0; i < 249; i++) { 
y[i] = 0; 

for(j = 0;j<8;j++) 

tmp[j] = c[j] * x[i+7-j]; 
25 for(j = 0;j<8;j++) 

y[i] += tmp [j]; 

} } 

[0197] The following p arameter table given below corresponds to the two inner loops in order 
30 to be compared with the preceding table. TABLE US 001 17 Parameter Value Vector length 
input 
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Par ameterV alue Vector lengthinput : 8, output: 1 Reused data set size — I/O IRAMs 5 ALU 2 
BREG 0 FREG 1 Dataflow 1 Data flow graph width 1 Dataflow 1 Data flow graph height 3 
Configuration cycles 1 * 8 + 1 * 8 ~ 16 1*8+1*8=16 

fO'l 981 Then we must take into account the The architecture T may be taken into account. The 
first loop is fully parallek-this . which means that we would need 2*8=16 input values at a time. 
This is all right, as it corresponds to the number of IRAMS of the PACT XPP. Hence wo do, 
not need ^ to strip-mine the first inner loop . The case o f is not required. To strip-mine the second 
loop is trivial, it does not need to be strip mined either, also not required. The second loop is a 
reductionT4 t. It computes the sum of a vector. This is may be easily found by the reduction 
recognition optimization and wo obtain the following code may be obtained . TABLE US 001 18 

for (i = 0; i < 249; i++) { 
y[i] = 0; 

forG = 0;j<8;j++) 

tmp[j] = c[j] * x[i+7-j]; 

/* load the partial sums from memory using a shorter vector length */ 
for(j = 0;j<4;j++) 

aux[j] = tmp[2*j] + tmp[2*j+l]; 

/* accumulate the short vector */ 
for(j = 0;j< l;j++) 

aux[2*j] = aux[2*j] + aux[2*j+l]; 

/* sequence of scalar instructions to add up the partial sums */ 
y[i] = aux[0] + auxf{2]; 

} 

[0199] Like above we give^ only one table is given below for all innermost loops and the last 
instruction computing y [i] . TABLE US 00119 Parameter Value Vector length input 

Par ameterV alue Vector lengthinput : 256, output: 249 Reused data set size — I/O IRAMs 9 ALU 
4 BREG 0 FREG 0 Dataflo w OData flow graph width 1 Dataflow 1 Data flow graph height 4 
Configuration cycles 1*8+1*4+1*1 = 13 
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[0500] Finally^ loop unrolling ismay_be applied on the inner loopsT-th e. The number of 
operations is always less than the number of processing elements of the PACT XPP. TABLE 
US 00120 

for (i = 0; i < 249; i++) 

{ tmp[0] ~ c[0] * x[i+7]; tmp[l] ~ c[l] * x[i+6]; tmp[2] - c[2] * x[i+5]; tmp[3] - c[3] * 
x[i+1]; tmp[1] ~ c[1] * x[i+3]; tmp[5] ~ c[5] * x[i+2]; tmp[6] ~ c[6] * x[i+l]; tmp[7] ~ c[7] * 
x[i]; aux[0] - tmp[0] + tmp[l]; aux[l] - tmp[2] + tmp[3]; aux[2] - tmp[1] + tmp[5]; aux[3] ~ 
tmp[6] + tmp[7]; aux[0] ~ aux[0] + aux[l]; aux[2] ~ aux[2] + aux[3]; y[i] ~ aux[0] + aux[2]; ] 



tmprOl = crOl * xfi+71 



tmp[l] 


= cm * xR+61; 


tmpm 


= cm * xri+51; 


tmpm 


= cr31 * xri+41; 


tmpT41 


= cr41 * xri+31; 


tmp[5) 


= cm * xri+21; 


tmpT61 


= cr61 * xri+11: 


tmpm 


= cm * xril: 


aux[0] 


= tmprOl + tmpril: 


auxfl] 


= tmp[2] + tmp[3]; 


auxm 


= tmr>r41 + tmt»r51; 


auxr31 


= tmpr61 + tmpm; 


aux[0] 


= aux[0] + aux[l]; 


auxm 


= auxf21 + auxf31; 


y[i] = aux[01 + aux[21; 



r05011 We obtain then the The dataflow graph illustrated in Fig. 19. r epresenting the inner loop 
as shown in FIG. 19. , may be obtained. 

[0502] It could be mapped on the PACT XPP with each layer executed in parallel, thus 
needin g requiring 4 cycles/iteration and 15 ALU-PAEs, 8 of which are n eeded in parallel. As 
the graph is already synchronized, the throughput reaches one iteration/cycle- after 4 cycles to 
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fill the pipeline. The coefficients are taken as constant inputs by the ALUs performing the 
multiplications. 

[0503] The A drawback of this solution is may be that it uses 16 IRAMs, and that the input data 
5 must be stored in a special order. The number of needed IRAMs can be reduced if the 

coefficients are handled like constant for each ALU. But due to data locality of the program, 
w-eit can assume be assumed that the data already reside in the cache. And as As the transfer of 
data from the cache to the IRAMs can be achieved efficiently, the configuration iscanbe 
executed on the PACT XPP without waiting for the data to be ready in the IRAMs. The 
1 0 Accordingly, the parameter table is tho n may be the following: TABLE US 00121 Parameter 
Value Vector length input 

Par ameterV alue Vector lengthinput : 256, output: 249 Reused data set size — I/O IRAMs 16 

ALU 15 BREG 0 FREG 0 Dataflow OData flow graph width 8 Dataflow 8Data flow graph height 
1 5 4 Configuration cycles 4 + 249 = 253 

Variant with Larger Bounds 

[0501] To make the things a bit more interesting, we so t in one case, the values of N and M were 

set to 2048 and 64. TABLE US 00122 

for(i = 0; i< 1985; i++) { 
20 y[i] = 0; 

for(j = 0;j<64;j++)^ 

y[i] += cD'] * x[i+63-j]; 

} 

25 [0505] The data dependence graph is the same as above. We apply then node Node splitting 
may then be applied to get a more convenient data dependence graph. TABLE US 00123 
for(i = 0; i< 1985; i++) { 
y[i] = 0; 

for(j' = 0;j<64;j++) 
30 { 

tmp = c[j] * x[i+63-j]; 
y[i] += tmp; 

H 

1 
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[0506] After scalar expansion: TABLE US 00121 
for(i = 0; i< 1985; i++) { 
y[i] = 0t-0 i 
5 for(j' = 0;j<64;j++) 

{ 

tmpQ] = cQ] * x[i+63-j]; 
y[i] += tmp |j]; 

r 

10 I 



[0507] After loop distribution: TABLE US 00125 
for(i = 0; i< 1985; i++) { 

y[i] = 0; forQ-0;j<61;j++) 
15 for (j = 0; i < 64; 

tmpD] = cD'] * x[i+63-j]; 
for (j = 0; j < 64; j++) y[i] +- tmp[j]; ) ] 
y[i] += tmp[ j]; 

n 

20 

[0508] We go After going through the compiling process, and we arrive to t he set of 
optimizations that depends upon architectural parameters . Wo wan t may be arrived at. It might 
be desired to split the iteration space, as too many operations would have to be performed in 
parallel, if we keep it is kept as such. Hence we perform^ strip-mining may be performed on the 
25 2 loops. We can only access Only 16 data can be accessed at a time, so, because of the first 

loop, the factor will be 64 * 2/16 = 8 for the 2 loops (as wo always have in mind that wo want it 
is desired to execute both at the same time on the PACT XPP). TABLE US 00126 
for (i = 0; i < 1985; i++) { y[i] - 0; for (jj - 0; jj < 8; jj++) for (j~0;j < 8; j++) 

ym = 0 

30 for( jj = 0; j j<8; jj++) 

for( j = 0; j <8; j++) 

tmp[8*jj+j] = c[8*jj+j] * x[i+63-8*jj-j]; 
for (jj = 0; jj < 8 ; jj++) 

for(j= 0;j<8;j++) 
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} 



y[i] +=tmp[8*jj+j]; 



[05091 And the n Then, loop fusion on the jj loops is may be performed. TABLE US 00127 
5 for (i = 0; i< 1985; i++) { 

y[i] = 0; for Qj = 0; jj < 8; jj++) ( for (j~0;j < 8;j++) 
for (ii = 0; ii < 8; ( 
for (\ = 0; i < 8;i++) 



15 



tmp[8*jj+j] = c[8*jj+j] * x[i+63-8*jj-j]; for (j-Oj < 8;j++) y[i] +- 

10 tmp[8*jj+j]; ] ] 

for (j = 0; i < 8; 

yJU +=tm P r8 !); ii+il; 



r05 101 Now we apply rcduction Reduction recognition may then be applied on the second 
innermost loop. TABLE US 00128 
for(i = 0; i< 1985; i++) { 
tmp = 0; 

20 for(jj = 0;jj<8;jj++) 

{ 

for(j = 0;j<8;j++) 

tmp[8*jj+j] = c[8*jj+j] * x[i+63-8*jj-j]; 

25 /* load the partial sums from memory using a shorter vector length */ 

for(j = 0;j<4;j++) 

aux[j] = tmp[8*jj+2*j] + tmp[8*jj+2*j+l]; 

/* accumulate the short vector */ 
30 forO' = 0;j<l;j++) 

aux[2*j] = aux[2*j] + aux[2*j+l]; 

/* sequence of scalar instructions to add up the partial sums */ 
y[i] = aux[0] + aux[2];4 
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|"05 1 1 1 And thon loop Loop unrolling . TABLE US 00129 may then be performed: 
for(i = 0; i< 1985; i++) 

for0'j = 0;jj<8;jj++) 
5 { 



tmp[8*jj] 


= c[8*jj] 


* 


x[i+63-* 




tmp[8*jj+l] 


= c[8*jj+l] 


* 


x[i+62-J 




tmp[8*jj+2] 


= c[8*jj+2] 


* 


x[i+61-f 




tmp[8*jj+3] 


= c[8*jj+3] 


* 


x[i+59-i 




tmp[8*jj+4] 


= c[8*jj+4] 


* 


x[i+58-S 




tmp[8*jj+5] 


= c[8*jj+5] 


* 


x[i+57A 




tmp[8*jj+6] 


= c[8*jj+6] 


* 


x[i+56-S 




tmp[8*jj+7] 


= c[8*jj+7] 


* 


x[i+55-* 





15 aux[0] = tmp[8*jj] + tmp[8*jj+l]; 

aux[l] = tmp[8*jj+2] + tmp[8*jj+3]; 
aux[2] = tmp[8*jj+4] + tmp[8*jj+5]; 
aux[3] = tmp[8*jj+6] + tmp[8*jj+7]; 



20 aux[0] = aux[0] + aux[l]; 

aux[2] = aux[2] + aux[3]; 



y[i] = aux[0] + aux[2]; 

} 

25 

[05 12] Wo implement thc The innermost loop may be implemented on the PACT XPP directly 
with a counter. The IRAMs ar emay be used in FIFO mode, and filled according to the 
addresses of the arrays in the loop. IRAMO, IRAM2, IRAM4, IRAM6 and IRAM8 contain 
array 'c'. IRAM1, IRAM3, IRAM5 and IRAM7 contain array V. Array 'c'_ contains 64 
30 elements, that is i.e., each IRAM contains 8 elements. Array 'x' contains 1024 elements, tfeat 
i si.e., 128 elements for each IRAM. Array 'y' is directly written to memory, as it is a global 
array and its address is constant. This-j constant is used to initialize the address counter of the 
configuration.-The A final parameter table is the following: TABLE US 00130 Parameter 
Value Vector length input 
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Par ameterV alue Vector lengthinput : 8, output: 1 Reused data set size — I/O IRAMs 16 ALU 15 
BREG 0 FREG 0 Dataflow OData flow graph width 8 Dataflow 8Data flow graph height 4 
Configuration cycles 4 + 8 = 12 
5 [0513] N evertheless^ it should be noted that this version should be less efficient than the 
previous one. As the same data must be loaded in the different IRAMs from the cache, we 
have there are a lot of transfers to achieve be achieved before the configuration can begin the 
computations. This overhead must be taken into account by the compiler when choosing the 
code generation strategy. As already stated, this This means also that the first solution is the 
10 solution that will be chosen by the compiler. 

[0511] 5.5.1 Final Code TABLE US 00131 

intx[256],y[256]; 
15 const int c[8] = { 2, 4, 4, 2, 0, 7, -5, 2 }; 

main( ) 

{ 

XppPreloadConfig(7Sttb: XppCfg fir); 

20 XppPreload(0, x,128); 

XppPreload(l, x +128,128); 
XppExecute( ); 
XppSync(y,249); 

} 

25 

void .sub. _XppCfg_fir( ) { 

// Input IRAMs 

int iram0_l[128], iram0_2[128]; 
// Output IRAMs 
30 intiraml_l[128],iraml_2[128]; 

int *piram0_l,*piraml_l; 
piramOl = &iram0_l[0]; 
piraml 1 = &iraml 1[0]; 
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for (i = 0;i < 249;i++) 
{ 

rO = rl; 

rl = r2; r2 ~ r3; r3 — rl ; H ~ r5; r5 ~ r6; r6 ~ r7; r7 ~ iramO_l++; if (i < 
5 128) piramO_l++ - iram0_2++; cloo if (i — 128) iramOl ~ &iramO_l[0]; iraml_l++ - c7*r0 
+ c6*rl + c5*r2 + e4%3 + c3*r1 + c2*r5 + cl*r6 + c0*r7; if (i < 128) iraml_2++ - 
piraml_l++; else if (i — 128) iraml l ~ &iraml_l[0]; ] ] 

r2 = r3; 

r3 = r4; 

10 r4 = r5; 

r5 = r6; 

r6 = r7; 

r7 = iramO 1 ++; 



15 if q < 128) 

piramO 1++ = iramO 2++; 

else 

if (i =128) 

iramOl = &iramO_l|"0"|; 

20 

iraml 1++ = c7*r0 + c6*rl + c5*r2 + c4*r3 + c3*r4 + c2*r5 + cl*r6 + 

c0*r7; 

if (i < 128) 

iraml 2++ = piraml 1++; 

25 else 

if (i == 128) 

iramll = &iraml_l|"0"|; 

I 

I 

30 

5.5.5 Performance Evaluation 

[0515] The table below contains data about loading input data from memory, and writing output 
data to memory for the FIR example. The cache is supposed to be empty before execution. The 
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write -back of array y causes no cache miss, because it is only an output data. TABLE US 
00132 Size Cache RAM Cache - Cache - IRAM Data [bytes] Misses [cache cycles] [cache cycles] 
Preloads x 512 16 896 32 x + 128 512 16 896 32 Sum 1792 61 Writebacks y 996 0 1021 63 
Sum 1021 63 

5 

DataSize [bytes]Cache MissesRAM to Cache [cache cycles]Cache to IRAM [cache 
cvcleslPreloadsx5121689632x + 

1285121 689632Suml 79264 Writebacksv9960 1 02463 Suml 02463 

[0516] In the performance evaluation, the XPP performance is compared to a reference system. 
1 0 The performance data of the reference system was calculated by using a production compiler 
for a dual issue 32 bit fixed point. DSP. As the RAM to Cache transfer penalty is the same for 
the XPP and reference system, it can be neglected for the comparison. It is assumed that the 
DSP can perform a load and memory store in one cycle. 

15 [0517] The base for the comparison is the hand- written NML source code fir simple.nml which 
implements the configuration XppCfg fir. The final performance evaluation table below lists 
the performance data for the configuration. The transfer cycles for the configuration contain 
preloads and write -backs necessary for executing the configuration in the steady state case, but 
not in the startup case where only the preloads are accounted for. 

20 

[0518] The XPP execute cycles are calculated by taking the double cycle difference between the 
end of the configuration execution and the start of the configuration execution. The NML 
sources were TABLE US 00133 Data Access Configuration XPP Execute Rof System Speedup 
configurations RAM DCache RAM ICache Core Cache RAM Cache RAM Core Cache RAM 
25 startup case 1792 61 2161 318 618 618 1968 17963 19755 27.7 27.7 1.0 steady state 2816 127 
618 618 2816 17963 20779 27.7 27.7 7.1 implemented so that configuration loading and 
configuration execution do not overlap. 

Data AccessConfigurationXPP ExecuteRef. SvstemSpeedupconfigurationsRAMDCacheRAM 
30 ICacheCoreCacheRAMCacheRAMCoreCacheRAMstartup 
casel7926424643486486484968179631975527.727.74.0 

Data AccessConfigurationXPP ExecuteRef SystemSpeedupconfigurationsRAMDCacheRAM 

ICacheCoreCacheRAMCacheRAMCoreCacheRAMsteady 

state28 1 6 1 2764864828 1 6 1 79632077927.727,77,4 
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[0519] The final utilization of the resources is shown in the following table. The information is 
taken from the ".info" files generated from the NML source code by the XMAP tool. The 
difference concerning the number of ALUs between this table and the final parameter table 
presented before resides in the fact that additions can be executed either by ALUs or BREGs. In 
5 the former parameter table, the additions were meant to be executed by ALUs, whereas in the 
NML code, these are mainly performed by BREGs. TABLE US 00131 Parameter Value Vector 
length read: 256, write: 219 Reused data set size — I/O IRAMs [sum pet] 1 25% ALU[sum pet] 
10 16% BREG [def/routc/sum pet] 15/2/17 21% FREG [def/routc/sum pet] 16/3/19 21% 

10 ParameterValueVector lengthread:256, write:249Reused data set size-I/O IRAMs [sum-pct]4- 
25%ALU [sum-pet] 10-16%BREG [def/route/sum-pct] 15/2/1 7 - 21%FREG [def/route/sum- 
pct] 16/3/1 9 -24% 

[0520] Usually the function computing FIR is called in a loop. In FIG. 20 is sketched how 
different iterations can overlap. First the configuration itself is loaded, Ld Config, then the data 
1 5 needed for the first iteration, Ld Iteration 1 . The configuration is then executed. Ex Iteration 1 , 
and the write-back phase, WB Iteration 1 , takes place. The steady state is contained in the 
orange box. It is the kernel of the loop, and contains phases of four different iterations. After the 
kernel has been executed (n-3) times, n being the number of iterations of the loop, the 
remaining phases are executed. 

20 

5.5.6 Other Variant 

[0521] Source Code TABLE US 00135 

for (i = 0; i < N-M+l- i++) { 
tmp = 0; 

25 forCj = 0;j<M;j++) 

tmp+=cD'] *x[i+M-j-l]; 
x[i] = tmp; 

} 

30 [0522] In this case, it is trivial that the data dependence graph is cyclic due to dependences on 
tmp. Therefore,, scalar expansion is applied on the loop, and wo obtain^ in fact^ the same code 
as the first version of the FIR filter is obtained as shown below. TABLE US 00136 

for (i = 0; i < N-M+l; i++) { 
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tmp[i] = 0; 

for(j = 0;j<M;j++) 

tmp[i] += c[j] * x[i+M-j-l]; 
x[i] = tmp[i]; 



} 



5^-Matrix Multiplication-S-^ 
Original Code 

[0523] Source Code: TABLE US 00137 code: 

10 #defmeL10 

#defmeM 15 

#defme N 20 

int A[L][M]; 

intB[M][N]; 
15 intR[L][N]; 



main( ) { 

int i, j, k, tmp, aux; 



20 



/* input A (L*M values) */ 
for (i=0; i<L; i++) 

for fe(j=0; j<M; j++) 

scanf("%d", &A[i][j]); 



25 



/* input B (M*N values) */ for(i~0; i<M; i++) forQ-0; j<N; j++) 8canf("%d", &B[i][j]); 
for (i=Q; i<M; j±+) 



for (i=0: i<N: 



scanfT"%d". &Bmim 



30 /* multiply */ 

for (i=0; i<L; i++) 

for(j-=0;j<N;j++) { 
aux = 0; i 

for (k=0; k<M; k++) 
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aux+=A[i][k] *B[k]b']; 
R[i][j] = aux; 

} 



10 



/* write data stream */ 
for (i=0; i<L; i++) 

for(j=0;j<N;j++) 

printf("%d\n", R [i][j]); 



5.6.2 Preliminary Transformations 

[0521] Since no inline-able function call is candidate for inlinin g calls are present , no 
interprocedural code movement is done. 



1 5 [0525] Of the four loop nests^ the third one with the "/* multiply */" comment is the only 

candidate for running partly on the XPP. All others have function calls in the loop body and are 
therefore discarded as candidatc candidates very early during the compilation process. TABLE 
US 00138 SI in the compiler. 



20 Dependency Analysis 

for (i=0; i<L; i++) 

for(j=0;j<N;j++) { 

SI aux = 0;-S2 

for (k=0; k<M; k++) 
25 S3-2 aux += A[i][k] * B[k]|j]; 

S3 R[i]D1 = aux ; 

} 



[0526] The data dependence dependencv graph shows no dependence dependencies that 
30 provonts prevent pipeline vectorization. The loop- carried true dependence from S2 to itself can 
be handled by a feedback of aux as described in fj-fr -Markus Weinhardt et al., "Memory Access 
Optimization for Reconfigurable Systems," supra. 



Reverse Loop-Invariant Code Motion 
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[0527] To get a perfect loop nest wo move SI , SI and S3 may be moved inside the k-loop. 
Therefore.! appropriate guards af emav be generated to protect the assignments. The code after 
this transformation looks like TABLE US 00139 is as follows: 
for (i=0; i<L; i++) 
5 for(j=0;j<N;j++) 

for (k=0; k<M; k++) { 

if(k = 0)aux[j] = 0; 
auxin += A[i][k] *B[k]D"]; 
if (k = M-l) R[i][j] = auxfj]; 

10 } 

Scalar Expansion 

[0528] Our A goal i smay be to interchange the loop nests to improve the array accesses to utilize 
the cache best. Unfortunately However, the guarded statements involving 'aux ' may cause 
1 5 backward loop- carried anti- dependences dependencies carried by the j- loop. Scalar expansion 
wiU may break these dependences dependencies . allowing loop interchange. TABLE US 00110 
for (i=0; i<L; i++) 

for(j=0;j<N; j++) 

for (k=0; k<M; k++) { 
20 if(k = 0)aux[j] = 0; 

aux[j]+=A[i][k]*B[k][j]; 
if(k = M-l) R[i][j] = aux[j]; 

} 

25 5.6.3 Loop Interchange for Cache Reuse 

[0529] FIG. 21 V isualizing the main loop shows the iteration spaces for the array accesses in the 
main loop . Fig. 21 is a visualization of array access sequences. Since C arrays in C are placed 
in row major orde^ the cache lines are placed in the array rows. At first sigh^ there seems to be 
no need for optimization because the algorithm requires at least one array access to stride over a 

30 column. Nevertheless^ this assumption^ misses the fact that the access rate is of interest, too. 

Closer examination shows that array R is accessed in every j iteration, while array B is accessed 
at each iteration of the k loop, which is very likely to produce a cache miss. every k iteration, 
always producing a cache miss, ("aux" is not currently discussed since it is not expected that it 
would be written to or read from memory, as there are no defs or uses outside the loop nest.) 
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This leaves a possibility for loop interchange to improve cache access as proposed by Kennedy 
and Allen in fT VMarkus Weinhardt et al., "Pipeline Vectorization," supra. 



|"05301 Findin g To find the best loop nest is relatively simple. The , the algorithm simply 
5 intorchangos may interchange each loop of the nestnests into the innermost position and 
annotate s annotate it with the so-called innermost memory cost term. This cost term is a 
constant for known loop bounds^ or a function of the loop bound for unknown loop bounds. 
The term i smay be calculated in three steps. [0531] 

* Firsts the cost of each reference . sup .1 in the innermost loop body is may be calculate d. It is 
10 eqaatto: [0532] 

* 1, if the reference does not depend on the loop induction variable of the (current) 

innermost loop [0533] ; 

* the loop count, if the reference depends on the loop induction variable and strides over a 

non contiguous area with respect teof the cache layout N s b , [0534]; 

15 * if the reference depends on the loop induction variable and strides over a contiguous 

dimension. In this case,, N is the loop count, s is the step size and b is the cache line size, 
respectively. [0535] 

In this case, a "reference" is an access to an array. Since the transformation attempts to optimize 
cache access, it must address references to the same array within small distances as one. This 
20 may prohibit over-estimation of the actual costs. 

* Second^ each reference cost i smay be weighted with a factor for each other loop, which is 
[0536] : 

* 1, if the reference does not depend on the loop index [0537] ; 

25 * the loop count, if the reference depends on the loop index. [0538] 

* Thirds the overall loop nest cost i smay be calculated by summing the costs of all reference 
costs . .sup.l Reference means access to an array in this case. Since the transformation wants to 
optimize cache access, it must address references to the same array within small distances as 

30 one. This prohibits over estimation of the actual costs. 

[0539] After invoking this algorithm for each loop level, the loop levels are ordered with 
respect to their cost. The as the innermost, the one with the lowest cost becomes may be chosen 
as the innermos t loop level , the one with next as the highest cost becomes the next outermost 
loop level in the loop nost. TABLE US 001 '1 1 , and so on. 
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Innermost leep- RloopR m Til A[i][k] B[k][j] Memory access cost k 1 LNMbLMNLN + Mb 
L + MNilLNlLMlMNLN + LM + MNjNb .times. LLMNb .times. M N b .times. 
5 (L + M) + LM BrklhlMemorv access costkl-L-NM-Nil-L-Nl-L-Ml-M-NL-N+L-M+M-NiL-M 
[0510] The preceding table shows the costs calculatod values for the loop ncst. matrix 
multiplication. Since the j term is the smallest ( b is 32 bytes or 8 integer words assuming b > 1 
), the j -loop is chosen to become be the innermost loop level. Then the . The next outer loop 
will bo tho k loop t hen is k , and the outermost loop will bo the i loop . is i. Thus^ the resulting 
1 0 code after loop interchange is: TABLE US 00112 may be: 
for (i=0; i<L; i++) 

for (k=0; k<M; k++) a 

for(j=0;j<N;j++) { 

if(k== 0)aux[j] = 0; 
15 auxD"]+=A[i][k] *B[k]D']; 

if(k = M-l) R[i][j] = auxQ']; 

} 

[051 1] FIG Fig . 22 shows the improved iteration spaces. It is to say that this It shows array 
20 access sequences after optimization. The improvement is visible to the naked eye since array B 
is now read following the cache lines. This optimization does not optimize primarily for the 
XPP-; but mainly optimizes the cache -hit rate, thus improving the overall performance. 

5.6.1 Enhancing Parallelism 

25 

[0512] After improving the cache access behavior, the possibility for reduction recognition has 
been destroyed. This is a typical example for transformations where one excludes the other. 
Fully unrolling the inner loop is not applicable due to the number of available IRAMs. 
Therefore we try to unroll-and-jam the two innermost loops. 

30 

Unroll-and-Jam 

[0513] We unroll the outer loop partially with the unrolling degree u. This factor is computed 
by the minimum of two calculations. u.sub.PAM 
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uRAM =IRAMs available/IRAMS needed u.sub.PAE 
uPAE=PAEs available/PAEs needed 



[0541] In this example the accesses to A and B depend on k (the loop which will be unrolled). 
5 Therefore they must be considered in the calculation. The accesses to aux and R do not depend 
on k. Thus they can be subtracted from the available IRAMs, but do not need to be added to the 
denominator. Therefore we calculate u.sub.RAM uRAM =14/2=7. 

[0545] On the other hand the loop body involves two ALU operations (1 add, 1 mult), which 
1 0 yields u.sub.PAE 

uPAE =64/2= 32.sup.2. .sup.2 322. 

This is a very inaccurate estimation, since it neither estimates the resources spent by the 
controlling network, which decreases the unroll factor, nor takes it into account that e.g the 
15 BREG-PAEs also have an adder, which increases the unrolling degree. Although it has no 

influence on this example the unrolling degree calculation of course has to account for this in a 
production compiler. 

[0546] The constraint generated by the IRAMs therefore dominates by far as 
20 u=min(7,32)=7. 

[0547] To keep the complexity of the configuration simple, we choose an unrolling degree u 
final 

ufinal = loop .times, .times, count / [_loop .times, .times, count / u] = 5. 

25 

[0518] The code after this transformation then reads: TABLE US 00113 
for(i=0; i<L;i++) { 

for(k=0; k<M; k+= 5) { 

for(j=0;j<N;j++) { 
30 if(k = 0)aux[j] = 0; 

aux[j]+=A[i][k]*B[k][j]; 

aux[j] += A[i][k+1] * B[k+l][j]; aux[j] +~ A[i][k+2] * B[k+2]U]; aux[j] 
+- A[i][k+3] * B[k+3][j]; aux[j] +~ A[i][k+4] * B[lrH][j]; 
auxhl += Amrk+21 * B[k+21[il; 
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aux[j] += AIYirk+31 * Brk+3iril; 



auxhl += Amrk+41 * Brk+4iril; 



if(k== 10) R[i]D'] = aux[j]; 



5 



1 



5.6.5 Final Code 

10 [0519] After allocation of the arrays and scalars to IRAMs the code running on the RISC looks 
like follows. The array aux storing the intermediate results is normally preloaded, although its 
value is not used in the first iteration of the k-loop. Nevertheless it must be preloaded by the 
other iterations, therefore we must issue an XppPreload, not an XppPreloadClean. TABLE US 



00111 



15 



XppPreloadConfig(7Sttk _XppCfg_matmult); 

for(i=0; i<L;i++) { 



25 



20 



XppPreload(12, &aux, N); 
XppPreload(0, &A[i][0], M); 
XppPreloadQ, &A[i][0], M); 
XppPreload(2, &A[i][0], M); 
XppPreload(3, &A[i][0], M); 
XppPreload(4, &A[i][0], M); 
XppPreloadClean(l 1, &R[i][0], N); 
for(k=0; k<M; k+= 5) { 



XppPreload(5, &k, 1); 
XppPreload(6, &B[k][0], N); 
XppPreload(7, &B[k+l][0], N); 
XppPreload(8, &B[k+2][0], N); 



30 



XppPreload(9, &B[k+3][0], N); 
XppPreload(10, &B[k+4][0], N); 
XppExecute( ); 
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[0550] The configuration is shown below. TABLE US 00115 
void .sub. XppCfg matmultf ) 

{ 

5 // IRA Ms 

//A[i][k] 

int iram0[128], iraml[128], iram2[128], iram3[128], iram4[128]; 

//k 

int iram5[128]; 
10 // B[k][j] ..B[k+4][j] 

int iram6[128], iram7[128], iram8[128], iram9[128], iraml0[128]; 
//R[i]D"],auxD'] 
intiramll[128], iraml2[128], 
forG=0;j<N;j++) { 
15 tmpl = iram0[iram5[0]] * iram6[j]; 

tmp2 = iraml[iram5[0]+l] * iram7[j]; 
tmp3 = iram2[iram5[0]+2] * iram8[j]; 
tmp4 = iram3[iram5[0]+3] * iram9[j]; 
tmp5 = iram4[iram5[0]+4] * iraml0[j]; 
20 if (iram5[0] = 0) 

tmp6 = tmpl + tmp2 +tmp3 +tmp4 +tmp5; 

else 

tmp6 += iraml2[j] + tmpl + tmp2 +tmp3 +tmp4 +tmp5; 
iraml2[j] = tmp6; 
25 if (iram5[0] == 10) 

iraml 1 [j] = tmp6; 

} 

} 

30 [0551] The estimated statistics are shown in the table below. Unfortunately the IRAM usage 
prevents a better utilization. FIG. 23 shows the dataflow graph of the configuration. TABLE 
US 00116 Parameter Value Vector 
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ParameterValueVector length 20 Reused data set size —I/O IRAMs 111+10+11/0=13 
ALU 10 BPiEG few FREG few Dataflo w ALU 1 OBREGfewFREGfewData flow graph width 14 
Dataflow Data flow graph height 6 Configuration cycles 6 + 20 = 26 
5.6.6 Performance Evaluation 

5 

[0552] The next table lists the estimated performance of data transfers. TABLE US 00117 
IRAM Size Cache RAM to Cache [cache Data [bytes] Misses [cache cycles] cycles] Factor 
Prcloads/i loop A[i][0] 60 2 112 1 A[i][0] 60 0 A A[i][0] 60 0 1 A[i][0] 60 0 A A[i][0] 60 0 A 
Sum 1 12 20 10 aux, stays 80 3 168 5 1 in cache Proloads/j loop B 
10 DataSize [bytes]Cache MissesRAM to Cache [cache cycles]IRAM [cache 
cycleslFactorPreloads/i 

loopA[i1 [016021 124Amr016004Amr016004Amr016004Amr016004Suml 122010aux. stays in 
cache80316851Preloads/i loot>B rk][0] 80 3 168 5 8031685 B[k + 1][0] 80 3 168 5 B[k + 2][0] 
80 3 168 5 B[k + 2][0] 80 3 168 5 B[k + A][0] 80 3 168 5 aux, stays 80 5 in cache Sum 810 30 
15 330 Writebacks aux, stays 80 5 

3O lir018031685Brk+2ir018031685Brk+2ir018031685Brk+4ir018031685aux. stays in 
cache805Sum84030330Writebacksaux. stays in cache 80530 R. written 80 96 5 10 back in i 
loop 8096510 

[0553] For the comparison with the reference system, we assume that first the configuration, the 
20 first five A[i][0] values and aux are preloaded, row startup i-loop. In the nine subsequent 

iterations of the i-loop, only five A[i][0] are preloaded, row steady i-loop. All loads of A[i][0] 
cause one cache miss and four hits. 

[0551] Furthermore we assume that all values of B are loaded into the cache during execution 
25 of the first iteration of the i-loop. They stay there during the other iterations. Thus cache read 
misses due to accesses to B are only taken into account three times, row j -loop i==0. All 
subsequent 27*5 accesses to B cause only cache-IRAM transfers, row j-loop i!=0. We assume 
that aux stays in its IRAM or is only written back in the cache during the whole execution. 
While the first assumption assumes that no task switch occurs during calculation of the whole 
30 matrix—a fact that we cannot guarantee—the second one is can safely be assumed. Due to the 
dominance of the execution cycles neither has an impact on the total performance. 

[0555] The last but one row, row WB R, shows the write-backs of the result matrix R, which 
occur ten times and are also added to the other terms. 
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[0556] The hand coded configuration cycles are measured to 55 XPP cycles, or 1 10 cache 
cycles. TABLE US 001 ^18 Data Access Configuration XPP Execute Ref. System Speedup 
configurations RAM DCache RAM ICache Core Cache RAM Cache RAM Core Cache RAM 
5 startup i loop 280 25 1232 687 687 1512 steady i loop 112 25 25 112 j loop i— 0 810 30 110 
110 810 j loop i!-0 35 110 110 110 WB R 96 5 5 96 sum 1768 3300 1262 8970 26279 31017 
8.0 6.2 3.5 

Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMstartup i-loop28025 12326876871 5 12steadv 
10 i-loopl 1225251 12i-loop i —0840301 101 10840i-loop i!=03Sl 101 101 10WB 
R965596sum476833004262897026279310478,06,23,5 

[0557] The final utilization is shown in the next table. TABLE US 00119 Parameter Value 
V octor 

ParameterValueVector length 20 Reused data set size — I/O IRAMs [sum-pet] 13 - 82% ALU 
15 [sum-pet] 13- 20% BREG [def/route/sum-pct] 10/27/37 - 46% FREG [def/route/sum-pct] 

17/9/28 - 35% 

5-r7-Viterbi Encode r 5.7.1 
Original Code 

[0558] Source Code: TABLE US 00150 

20 /* C-language butterfly */ 

#define BFLY(i) {\ 

unsigned char metric, mO, ml, decision; \ 

metric = ((Branchtab29 1 [i] [circumflox over ( )] ^ syml) + 

(Branchtab29 _2[i] [circumflex over ()} * sym2) + 1)/2;V_\ 
25 mO = vp->old_metrics[i] + metric^ \ 

ml = vp->old_metrics[i+128] + (15 - metric); \ decision = (mO ml) >~ 0;\ vp 
>now_motrics[2*i] ~ decision ? ml : m0;\ vp >dp >w[i/16] | ~ decision « ((2*i)&31);\ mO ~ 
(mctric+mctric 15);\ ml +~ (mctric+motric 15);\ decision ~ (mO ml) >~ 0;\ vp 
>new_metrics [2*i+l] ~ decision ? ml : m0;\ vp >dp >w[i/16] | ~ decision « ((2*i+l)&31);\ ] 

30 decision = (m0-ml) >= 0; \ 

vp->new_metrics[2*i] = decision ? ml : mO; \ 

vp->dp->w[i/161 1= decision « (Y2*i)&31); \ 

mO -= (metric+metric-15); \ 

ml += (metric+metric-15); \ 
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decision = (mO-ml) >= 0; \ 

vp->new metricsr2*i+11 = decision ? ml : mO; \ 

vp->dp->wri/161 1= decision « (Y2*i+l)&31); \ 

1 

5 

int update_viterbi29(void *p, unsigned char syml, unsigned char sym2) { 
int i; 

struct v29 *vp = p; 
unsigned char *tmp; 
10 int normalize = 0; 

for (i=0; i<8; i++) 

vp->dp->w[i] = 0; 

15 for(i=0;i<128;i++) 
BFLY(i); 

/* Renormalize metrics */ 
if (vp->new_metrics[0] > 150) { 
20 int i; 

unsigned char minmetric = 255; 

for (i=0; i<64; i++) 

if (vp->new_metrics[i] < minmetric) 
25 minmetric = vp->new_metrics[i]; 

for (i=0; i<64; i++) 

vp->new_metrics|T| = minmetric; -= minmetric; 
normalize = minmetric; 

} 



30 



vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics = vp->new_metrics; 
vp->new metrics = tmp; 
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return normalize; 

} 



5.7.2 Interprocedural Optimizations and Scalar Transformations 

[0559] Since no function call ia candidate for inlining Since no inline-able function calls arc 
present, in an embodiment of the present invention , no interprocedural code movement is done. 

[0560] After expression simplification, strength reduction, SSA renaming, copy coalescing and 
idiom recognition, the code looks liko m ay be approximately as presented below , whoro 
(statements were are reordered for convenience). Note that idiom recognition wi Umay find the 
combination of min( ) and use ef-the comparison result for decision and decision. However 
the resulting computation cannot be expressed in C, so wo describe it is described below as a 
comment : TABLE US 00151 ^ 

int update_viterbi29 (void *p,unsigned char symk -svml .u nsigned char sym2) { 
int i; 

struct v29 *vp = p; 
unsigned char *tmp; 
int normalize = 0; 

char *_vpdpw— = vp->dp->w; 
for (i=0; i<8; i++) 

* vpdpw ++ = 0; 

char *_bt29_l= Branchtab29_l; 

char *_bt29_2= Branchtab29_2; 

char *_vpom0= vp->old_metrics; 

char *_vpoml28= vp->old_metrics+128; char *_ 

char * v pnm= vp->new_metrics; 

char *_vpdpw= vp->dp->w; 

for(i=0; i<128;i++) { 
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unsigned char metric, tmp, mO, ml, _mO, ml, decision, decision; metric - 
((*_bt29_l++ [circumflex ovor ( )) oyml) + (*_bt29_2++ (circumflex ovor ( )) oym2) + l)/2; 
_tmp~ (metric+metric 15); mO — *_vpom++ + metric; 

metric = ((* bt29 1++ A svml) + 



f * bt29 2++ A svm2^ ±JV2j 



tmp= (metric+metric- 15); 



mO = * vpom++ + metric; 



ml = *_vpoml28++ + (15 - metric); _m0 ~ mO — _tmp; _ml ~ ml + _tmp; // 
10 decision ~ mO >~ ml; // decision ~ _m0 >~ ml; *_vpnm++ ~ min(m0,ml); // ~ decision ? 
ml : mO *_vpnm++ ~ min(_m0,_ml); // — decision ? ml : mO _vpdpw[i » 1] | ~ 

mO = mO - tmp; 

ml = ml + tmp; 



20 



30 



// decision = mO >= ml; 



15 // decision = mQ>=ml: 



15 vpnm++ = minfmO.ml); // = decision ? ml : mO 

|i _vpnm++ = min(_mO._ml); // = decision ? ml : mO 

vpdpwri » 41 | = ( mO >= ml) /* decision*/ « ((2*i) & 3 1) 

| ( ffl&mO >= _m4-ml) /*_decision*/ « ((2*i+l)&3 1); 



/* Renormalize metrics */ 
if(vp->new metrics [0] > 150) { 
int i; 

25 unsigned char minmetric = 255; 

char *_vpnm= vp->new_metrics; 
for (i=0; i<64; i++) 

minmetric = min(minmetric, *vpnm++); 



char *_vpnm= vp->new_metrics; 
for (i=0; i<64; i++) 

*vpnm++ -= minmetric; 
normalize = minmetric; 
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} 



5 



vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics = vp->new_metrics; 
vp->new_metrics = tmp; 



return normalize; 



} 



10 



5.7.3 Initialization and Butterfly Loop 

[0561] The first and second loop, in which the BFLY( ) macro has been expanded, are of 
interest for being executed on the XPP array, and need further examination. Below is the 
15 configuration source code of the first two loops: TABLE US 00152 

/** _XppCfg_viterbi29( ) 

* Performs viterbi butterfly loop 

* XPPIN: iram0,2 contains Branchtab29_l and Branchtab29_2, respectively 

20 * iram4,5 contains old metrics and old metrics+128, respectively 

* iraml,3 contains scalars syml and sym2, respectively 

* XPPOUT: iram6 contains the new metrics array 

* iram7 contains the decision array 

*/ 

25 void _XppCfg_viterbi29( ) 
{ 



// IPvAMs in FIFO mode 



// 



char *iram0; // Branchtab29_l, read access with 32-to-8-bit converter 



30 



char *iram2; // Branchtab29_2, read access with 32-to-8-bit converter 



char *iram4; // vp->old_metrics, read access with 32-to-8-bit converter 
char *iram5; // vp->old_metrics+128, read access with 32-to-8-bit 



converter 



short *iram6; // vp->new metrics, write access with 16-to-32-bit 
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converter 



// IRA Ms in RAM mode 

// 

5 int iraml[128]; // syml, read access 

int iram3[128]; // sym2, read access 
int iram7[128]; // vp->dp->w, write access 

int i; 

1 0 unsigned char syml , sym2; 

syml = iraml[0]; 
sym2 = iram3[0]; 

15 for(i=0;i<8;i++) 

iram7[i] = 0; 

for(i=0;i<128;i++) { 

unsigned char metric,_tmp, m0,ml,_m0,_ml; 

20 

metric = ((*iram0++ (circumflex over ( )} ^_ syml) + (*iram2++ (circumflex 
over ()) * sym2) + l)/2; 

_tmp= (metric « 1 ) - 1 5 ; 
mO = *iram4++ + metric; 
25 ml = *iram5++ + (15 - metric); 

mO = mO - tmp; 
ml = ml + tmp; 
// assuming big endian; little endian has the shift on the latter min( ) 
*iram6++ = (min(m0,ml) « 8) | min(_m0,_ml); 
30 iram7[i » 4] |= ( mO >= ml) « ((2*i) & 3 1) 

| (_m0 >= ml) « ((2*i+l)&31); 

H 

1 
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[0562] The dataflow graph is shown in FIG. 24 (the 32-to-8-bit converters are not shown). The 
solid lines represent flow of data, while the dashed lines represent flow of events. 

[0563] The recurrence on the IRAM7 access needs at least 2 cycles, i.e. 2 cycles are needed for 
5 each input value. Therefore a total of 256 cycles are needed for a vector length of 128. TABLE 
US 00153 Parameter Value Vector length road: 32(~128 chars), write: 61(~256 chars) Reused 
data set size — I/O IRAMs 61 + 20 ALU 26 BREG few FREG few Dataflow graph width 1 
Dataflow graph height 12 + 1 (32 to 8 bit converters) Configuration cycles 16 + 256 

10 ParameterValueVector lengthread: 32(=128 chars), write:64(=256 chars)Reused data set size- 
I/O IRAMs6I + 2QALU26BREGfewFREGfewData flow graph width4Data flow graph 
height!2+4 (32-to-8-bit converters)Configuration cycles!6+256 

[0561] A problem is then obvious: IRAM7 is fully busy reading and rewriting the same address 
16 times. Loop tiling with a tile size of 16 gives redundant load/store elimination a chance to 

15 read the value once, and accumulate the bits in a temporary variable, writing the value to the 
IRAM at the end of this inner loop. Loop fusion with the initialization loop allows then 
propagation of the zero values set in the first loop, to the reads of vp->dp->w[i] (IRAM7), 
eliminating the first loop altogether. Loop tiling with a tile size of 16 also eliminates the & 31 
expressions for the shift values: Since the new inner loop only runs from 0 to 16, value range 

20 analysis can compute that the & 31 expression is not limiting the value range anymore. 

[0565] All remaining input IRAMs are character (8-bit) based. Therefore 32-to-8-bit are 
converters are needed to split the 32-bit stream into an 8-bit stream. Unrolling is limited to 
unrolling twice due to ALU availability as well as due to the fact, that IRAM6 is already 1 6-bit 
25 based: unrolling once requires a shift by 16 and an or to write 32 bits ever cycle; unrolling 
further cannot increase pipeline throughput anymore. Hence the body is only unrolled once, 
eliminating one layer of merges. This yields two separate pipelines, each handling two 8-bit 
slices of the 32-bit value from the IRAM, serialized by merges. 

30 [0566] The resulting configuration source code is listed below, where unrolling has been 
omitted for the sake of simplicity : TABLE US 00151 

/** _XppCfg_viterbi29( ) 

* Performs viterbi butterfly loop 
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* XPPIN: iram0,2 contains Branchtab29 1 and Branchtab29 _2, respectively 

* iram4,5 contains old metrics and old metrics+128, respectively 

* iraml,3 contains scalars syml and sym2, respectively 

* XPPOUT: iram6 contains the new metrics array 

5 * iram7 contains the decision array 

*/ 

void _XppCfg_viterbi29( ) 
{ 

// IRAMs in FIFO mode 

10 // 

char *iram0; // Branchtab29_l, read access with 32-to-8-bit converter 
char *iram2; // Branchtab29 2, read access with 32-to-8-bit converter 
char *iram4; // vp->old_metrics, read access with 32-to-8-bit converter 
char *iram5; // vp->old_metrics+128, read access with 32-to-8-bit converter 
15 short *iram6; // vp->new_metrics, write access with 16-to-32-bit converter 

unsigned long *iram7; // vp->dp->w, write access 

// IRAMs in RAM mode 

// 

20 int iraml[128]; // syml, read access 

int iram3[128]; // sym2, read access 

int i, i2; 
int rise; 

25 unsigned char syml , sym2; 

syml = iraml[0]; 
sym2 = iram3[0]; 

30 for(i=0;i<8;i++) { 

rlse= 0; 

for(i2=0;i2<32;i2+=2) { // unrolled once 

unsigned char metric, _tmp, m0,ml,_m0,_ml; 
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metric = ((*iramO++ [circumflex ovor ( )] ^ syml) + (*iram2++ 
(circumflex ovor ( )) * sym2) + l)/2; 



5 



_tmp= (metric « 1) -15; 
mO = *iram4++ + metric; 
ml = *iram5++ + (15 - metric); 
_m0 = mO - tmp; 
ml = ml + tmp; 



iram6++ 



= (min(mO,ml) « 8) | min(_mO,_ml); 

( mO >= ml) « i2 | (_m0 >= ml) « (i2+l); 



rise = rise 



10 



} 

*iram7++ = rise; 



} 



} 



15 [0567] FIG. 25 shows the modified data flow graph (unrolling and splitting have been omitted 
for simplicity). 

[0568] Again, the recurrence with the rise scalar needs two cycles. With an unrolling factor of 
two, 128 cycles are needed for a vector length of 128. TABLE US 00155 Parameter Value 
20 Vector 

ParameterValueVector length 32 (read) / 64 (write) Reused data set size — I/O IRAMs 61 + 20 
ALU 2 * OALU2* 26 + 2 (join) = 62 BREG few FREG few Dataflow BREGfewFREGfewData 
flow graph width A Dataflow l Data flow graph height 12 + 4 (32-to-8-bit converters) =16 
Configuration cycles 16 + 128 
25 5.7.1 Re-Normalizatiom 

[0569] The Normalization consists of a loop scanning the input for the minimum and a second 
loop that subtracts the minimum from all elements. There is a data dcpcndcncc dependency 
between all iterations of the first loop and all iterations of the second loop. Therefore^ the two 
loops cannot be merged or pipelined. They wit hnav be handled individually. 



30 



Minimum Search 



[0570] The third loop is a minimum search in an array of bytes. The first version of the 
configuration source code is listed below: TABLE US 00156 
/** XppCfg calcmin( ) 
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* Performs a minimum search over a character array 

* XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 

*/ 

5 void _XppCfg_calcmin( ) 

{ 

// IRAMs in FIFO mode 

// 

unsigned char *iram6; // vp->new_metrics, read access with 32-to-8-bit converter 
1 0 // IRAMs in RAM mode 

// 

int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetric = 255; 
15 for(i=0;i<64;i++) { 

minmetric = min(minmetric, *iram6++); 

} 

iramO[0] = minmetric; 

} 

20 

[0571] As there is a recurrence with minmetric which needs two cycles, a total of 128 cycles are 
needed for a vector length of 64. TABLE US 00157 Parameter Value Vector 

ParameterValueVector length 1 6 (= 64 chars) Reused data set size — I/O IRAMs 1 + 1 ALU 2 
25 BREG 2 FREG 3 Dataflow 3Data flow graph width 1 Dataflow 1 Data flow graph height 1 + 4 
(32-to-8-bit converter) Configuration cycles 5 + 128 

[0572] Reduction recognition climinatcs may eliminate the dependence enfor minmetric.! 
enabling loop unrolling with an unrolling factor of 1 a four-times unroll to utilize the IRAM 
width of 32 bits. A split network has to be added to separate the 8- bit streams using 3 SHIFT 
30 and 3 AND operations. Tree balancing rodistributos may re-distribute the min( ) operations to 
minimize the tree height. TABLE US 00158 

/* * _XppCfg_calcmin( ) 

* Performs a minimum search over a character array 
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* XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 

*/ 

void _XppCfg_calcmin( ) 
5 { 

// IRAMs in FIFO mode 

// 

int *iram6; // vp->new_metrics, read access 
// IRAMs in RAM mode 
10 // 

int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetric = 255; 
for(i=0;i<16;i++) { 
1 5 unsigned long val; 

val = *iram6++; 

minmetric = min(minmetric , min( min(val & Oxff, (val » 8) & Oxff), 

min((val » 16) & Oxff, val » 24) )); 

} 

20 iram0[0] = (long)minmetric; 

} 

The following is a corresponding parameter table. 

r05731 TABLE US 00159 Parameter Value Vecto r ParameterValueVector length 16 Reused 
25 data set size — I/O0 IRAMs 1 I + 1 QALU8 OALU8 BREG 5 FREG 3 Dataflo w 3Data flow 
graph width A Dataflow l Data flow graph height 5 Configuration cycles 5 + 32 
[0571] The recurrence of two cycles makes it profitable to double the loop body. Reduction 
recognition again eliminates the loop-carried dependence on minmetric, enabling loop tiling and 
then unroll-and-jam to increase parallelism. Constant propagation and tree rebalancing reduce 
30 the dependence height of the final merging expression. The final configuration source code is 
listed below: TABLE US 00160 

/* * _XppCfg_calcmin( ) 

* Performs a minimum search over a character array 
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* XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 

*/ 

void _XppCfg_calcmin( ) 
5 { 

// IRAMs in FIFO mode 

// 

int *iram6; // vp->new_metrics, read access 
// IRAMs in RAM mode 
10 // 

int iram0[128]; // minmetric, write access 

int i; unsigned char minmetricO = 255, minmetric 1 =255; 

for(i=0;i<8;i++) { 

unsigned long val; 
15 val = *iram6++; 

minmetricO = min(minmetricO , min( min(val & Oxff, (val » 8) & Oxff), 

min((val » 16) & Oxff, val » 24) )); 

val = *iram6++; 

minmetric 1 = min(minmetricO , min( min(val & Oxff, (val » 8) & Oxff), 
20 min((val » 1 6) & Oxff, val » 24) )); 

} 

iram0[0] = (long)min(minmetricO, minmetric 1); 

} 

25 

[0575] TABLE US 00161 Parameter Value Vocto r ParameterValueVector length 1 6 Reused 
data set size — I/O0 IRAMs 1 I + 1 QALU OALU 16 BREG 10 FREG 0 Dataflo w OData flow 
graph width 2 * 4 ~ 8 Dataflow =8Data flow graph height 5 Configuration cycles 5 + 16 
Re -Normalization 

30 [0576] The fourth loop subtracts the minimum of the third loop from each element in the array. 
The read-modify-write operation has to be broken up into two IRAMs. Otherwise^ the IRAM 
ports will limit throughput. TABLE US 00162 

/** XppCfg subtract( ) 
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* Subtracts a scalar from a character array 

* XPPIN: iram6 contains the character input array 

* iramO contains the scalar which is subtracted 

* XPPOUT: iraml contains the result array 
5 */ 

void _XppCfg_subtract( ) 

{ 

// IRAMs in FIFO mode 

// 

10 unsigned char *iram6; // vp->new_metrics, read access with 32-to-8-bit converter 
unsigned char *iraml; // vp->new_metrics, write access with 8-to-32-bit converter 
// IRAMs in RAM mode 

// 

int iram0[128]; // minmetric, read access 
1 5 int i; 

unsigned char minmetric = iram0[0]; 

for(i=0;i<16;i++) { 

iraml ++ = *iram6++ - minmetric; 

H 

20 1 

The following is a corresponding parameter table. 

r05771 TABLE US 00163 Paramotor Value Vocto r ParameterValueVector length 16 (= 64 
chars) Reused data set size —I/O IRAMs 2 I + I P ALU 1 OALU1 + 2 (converters) BREG 2 
25 (converters) FREG 2 (converters) Dataflow Data flow graph width 1 Dataflow 1 Data flow graph 
height 1+8 (converters) Configuration cycles 9 + 64 

[0578] There isare no loop- carried dopcndcncc. dependencies. Since the size of the data size is 
8 feitsbytes, the inner loop can be unrolled four times without exceeding the IRAM bandwidth 
requirements. Networks splitting-; the 32-bit stream into 4 8-bit streamy and re ioinin g reioining 
30 the individual results to a common 32-bit result stream^ are inserted. The final configuration 
source code is listed below: TABLE US 0016^1 

/** _XppCfg_subtract( ) 

* Subtracts a scalar from a character array 
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* XPPIN: iram6 contains the character input array 

* iramO contains the scalar which is subtracted 

* XPPOUT: iraml contains the result array 

*/ 

5 void _XppCfg_subtract( ) 
{ 

// IRAMs in FIFO mode 

// 

int *iram6; // vp->new_metrics, read access 
10 int *iraml; // vp->new_metrics, write access 
// IRAMs in RAM mode 

// 

int iram0[128]; // minmetric, read access 
int i; 

1 5 unsigned char minmetric = iram0[0]; 

for(r=0;i<16;i++) { 

unsigned long val; 

unsigned char rO, rl, r2, r3; 

val = *iram6++; 
20 rO = (val & Oxff) - minmetric; 

rl = ((val » 8) & Oxff) - minmetric; 

r2 = ((val » 1 6) & Oxff) - minmetric; 

r3 = (val » 24) - minmetric; 

*iraml++ = (r3 « 24) | (r2 « 16) | (rl « 8) | rO; 
25 } 

} 

The following is a corresponding parameter table. 

r05791 TABLE US 00165 Parameter Value Vecto r ParameterValueVector length 1 6 Reused 
30 data set size — I/O0 IRAMs 2 I + 1 QALU OALU 1 1 BREG 6 FREG 0 Dataflo w OData flow 
graph width A Dataflow O Data flow graph height 5 Configuration cycles 5 + 16 = 21 
5^5- Final Code 

[0580] The code executed on the RISC is listed below. It starts the configurations: TABLE US 

NY0 1 1 64 1 442 1 49 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



int update_viterbi29(void *p, unsigned char syml, unsigned char sym2) 
{ 

struct v29 *vp = p; 
unsigned char *tmp; 
5 int normalize = 0; 
long _syml = syml; 
long _sym2 = sym2; 

XppPreloadConfig(_XppCfg_viterbi29); 

XppPreload(0, Branchtab29_l, 32); 
10 XppPreload(2, Branchtab29_2, 32); 

XppPreload(4, vp->old_metrics, 32); 

XppPreload(5, vp->old metrics + 128, 32); 

XppPreload(l, &_syml, 1); 

XppPreload(3, &_sym2, 1); 
15 XppPreloadClean(6, vp->new_metrics, 64); 

XppPreloadClean(7, vp->dp->w, 8); 

XppExecute( ); 

/* Renormalize metrics */ 

if(vp->new_metrics[0] > 150){ 
20 long minmetric; 

XppPreloadConfig ( XppCfg calcmin); 

XppPreloadClean(0, &minmetric, 1); 

XppExecute( ); 

XppPreloadConfig(_XppCfg_subtract); 
25 XppPreloadClean(5, vp->new_metrics, 16); 
XppExecute( ); 
XppSync(&minmetric, 1); 
normalize = minmetric; 
} 

30 XppSync(vp->new_metrics, 64); 
vp->dp++; 

tmp = vp->old_metrics; 
vp->old_metrics = vp->new_metrics; 
vp->new metrics = tmp; 
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return normalize; 

} 



[0581] The three configurations are shown in the following: TABLE US 00167 
5 /** _XppCfg_viterbi29( ) 

* Performs viterbi butterfly loop 

* XPPIN: iram0,2 contains Branchtab29_l and Branchtab29_2, respectively 

* iram4,5 contains old metrics and old metrics+128, respectively 

* iraml,3 contains scalars syml and sym2, respectively 
10 * XPPOUT: iram6 contains the new metrics array 

* iram7 contains the decision array 
*/ 

void _XppCfg_viterbi29( ) 

{ 

1 5 // IRAMs in FIFO mode 

// 

char *iram0; // Branchtab29_l, read access with 32-to-8-bit converter 
char *iram2; // Branchtab29_2, read access with 32-to-8-bit converter 
char *iram4; // vp->old_metrics, read access with 32-to-8-bit converter 
20 char *iram5; // vp->old_metrics+128, read access with 32-to-8-bit converter 
short *iram6; // vp->new_metrics, write access with 16-to-32-bit converter 
unsigned long *iram7; // vp->dp->w, write access 
// IRAMs in RAM mode 
// 

25 int iraml[128]; // syml, read access 
int iram3[128]; // sym2, read access 
int i, i2; 
int rise; 

unsigned char syml, sym2; 
30 syml = iraml[0]; 
sym2 = iram3[0]; 
for(i=0;i<8;i++) { 
rlse= 0; 

for(i2=0;i2<32;i2+=2) 
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{ 

// unrolled once 

unsigned char metric,_tmp, mO,ml,_mO,_ml; 
metric = ((*iramO++ {circumflex over ( )} syml) + 
5 (*iram2++ {circumflex over ( )} sym2) + l)/2; 

_tmp= (metric « 1) -15; 
mO = *iram4++ + metric; 
ml = *iram5++ + (15 - metric); 
_m0 = mO - tmp; 
10 ml = ml + tmp; 

*iram6++ = (min(m0,ml) « 8) | min(_m0,_ml); 
rise = rise | ( mO >= ml) « i2 

| (_m0 >= ml) « (i2+l); 

} 

15 *iram7++ = rise; 
} 
} 

/** _XppCfg_calcmin( ) 

* Performs a minimum search over a character array 
20 * XPPIN: iram6 contains the character input array 

* XPPOUT: iramO contains the minimum value 

*/ 

void XppCfg calcmin( ) 
{ 

25 // IRAMs in FIFO mode 

// 

int *iram6; // vp->new_metrics, read access 
// IRAMs in RAM mode 

// 

30 int iram0[128]; // minmetric, write access 
int i; 

unsigned char minmetricO = 255, minmetricl = 255; 
for(i=0;i<16;i++) { 
unsigned long val; 
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val = *iram6++; 

minmetricO = min(minmetricO , min( min(val & Oxff, (val » 8) & Oxff), 

min((val » 16) & Oxff, val » 24) )); val = *iram6++; 
minmetric 1 = min(minmetricO , min( min(val & Oxff, (val » 8) & Oxff), 
5 min((val » 1 6) & Oxff, val » 24) )); 

} 

iramO[0] = (long)min(minmetricO, minmetric 1); 

} 

/** _XppCfg_subtract( ) 
10 * Subtracts a scalar from a character array 

* XPPIN: iram6 contains the character input array 

* iramO contains the scalar which is subtracted 

* XPPOUT: iraml contains the result array 
*/ 

1 5 void _XppCfg_subtract( ) 
{ 

// IRAMs in FIFO mode 

// 

int *iram6; // vp->new_metrics, read access 
20 int *iraml; // vp->new_metrics, write access 
// IRAMs in RAM mode 

// 

int iram0[128]; // minmetric, read access 
int i; 

25 unsigned char minmetric = iram0[0]; 

for(i=0;i<16;i++) { 

unsigned long val; 

unsigned char rO, rl, r2, r3; 

val = *iram6++; 
30 rO = (val & Oxff) - minmetric; 

rl = ((val » 8) & Oxff) - minmetric; 

r2 = ((val » 1 6) & Oxff) - minmetric; 

r3 = (val » 24) - minmetric; 

*iraml++ = (r3 « 24) | (r2 « 16) | (rl « 8) | rO; 
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} 
} 



5.7.6 Performance Evaluation 

5 

[0582] The data transfer performance is listed for each data object in the following table. It is 
assumed that there is no data in the cache before executing the update_viterbi29 function. In 
addition it is assumed that the if condition in the source code is true, i.e. new_metrics[0]>150. 
TABLE US 00168 Typo 

1 0 DataData SizeType size [bytes] Size [bytes] Cache RAM — Cache Cache — IRAM Data Data Size 
[bytes] [bytes] Misses [cache cycles] [cache cycles] Preloads Branchtab29_l 128 1 128 1 221 8 
Branchtab29 2 128 1 128 1 221 8 vr> >old metrics 128 1 128 1 221 8 MissesRAM - Cache 
[cache cycles] Cache - IRAM [cache 

cvcleslPreloadsBranchtab29 1 1281 12842248Branchtab29 21281 12842248vp- 
1 5 >old metricsl28112842248v p->old metrics + 128 128 1 128 1 221 8 vp >new_metrics 256 1 
256 8 118 16 syml 1 1 1 1 56 1 sym2 1 1 1 1 56 1 minmotric 1 1 1 1 56 1 Writebacks vp >dp 
>w 8 1 32 1 88 2 1281281 12842248vp- 

>new_metrics2561256844816symll441561sym21441561minmetricl441561Writebacksvp- 
>dp->w84321882v p->new metrics 256 1 256 256 16 256125625616 minmetric 1 1 1 1 88 1 
20 1441881 



[0583] The write-back of the elements of new metrics causes no cache miss, because the cache 
line was already loaded by the preload operation of old metrics. Therefore the write-back does 
not include cycles for write allocation. 

25 

[0581] The base for the comparison are the hand-written NML source codes vit.nml, min.nml 
and sub.nml which implement the configurations _XppCfg_viterbi29, XppCfgcalcmin and 

XppCfg subtract, respectively. For the _XppCfg_viterbi29 configuration two versions are 
evaluated: with unrolling (vit.nml) and without unrolling (vit nounroll.nml). 

30 

[0585] The performance evaluation was done for each configuration separately, and for all 
configurations of the update_viterbi29 function. It is assumed that the separate configurations 
are the only configuration s in the test case .sup. 3 . Therefore the separate configurations need 
different preloads and write-backs. The following table lists the required data transfers based on 
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the table above. Column Data RAM gives the number of cycles needed for the data transfer 
between RAM and cache. Column DCache gives the number of cycles needed for the data 
transfer between cache and I RAM. .sup .3 For testing the separate configurations no RISC 
source code is given. It must contain the XppPreload and XppPreloadClean functions for the 
5 required preloads and write backs. TABLE US 00169 con Data figurations preloads write 
backa RAM DCache vitcrbi29 Branchtab29_l 

configurationspreloadswrite-backsData RAMDCacheviterbi29Branchtab29 1 
Branchtab29_2 

1 0 vp-> old_metricsvp-> new_metrics 1352 52 Branchtab29_2 
vp->dp->w vp >old metrics 135252 vp->old metrics+ 128 
syml 

sym2 calcmin vp calcminvp ->new metricsminmetric53617subtractvp->new metrics minmotric 
536 17 subtract vp >now_mctrics vp >now_motrics 760 33 minmctric all Branchtab29_l 
15 minmetricvp->new metrics76033all configurationsBranchtab29 1 
Branchtab29 2 
vp->oId_metries 
vp->old_metrics+ 128 
syrnl 

20 sym2 vp->dp->w 3-440 53 con Branchtab29 2 
minmetric figurations vp >old_metrics 
vp->new_metrics vp >old_metrics+128 syml sym2 144053 

[0586] TABLE US 00170 TABLE 1 Performance on IDCT (8 .times. 8) LEON with XPP 
25 LEON with XPP LEON with XPP LEON alone in IRQ Mode in Poll Mode in Hold Mode 
Configuration — 71.308 ns 81.361 ns 77.976 ns of XPP 17.827 cycles 21.091 cycles 19.191 
cycles 2D IDCT (8 .times. 8) 11.672 ns 3.272 ns 3.872 ns 3.568 ns 3.668 cycles 818 cycles 968 
cycles 892 cycles 

30 [0587] 0 

[0588] In the following tables the performance is compared to the reference system. 

[0589] The first table is the worst case, representing the current example. Since no outer loop is 
given, the configurations cannot be assumed to be in cache. Moreover, an XppSync instruction 
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has to be inserted at the end of the function to force write-backs to the cache, ensuring data 
consistence for the caller. This setup prevents pipelining of the Ld/Ex/WB phases of the 
computation, therefore the number of cycles of the RAM and Cache accesses for the XPP has to 
be added to the computation cycles instead of taking the maximum (columns XPP Execute - 
5 Cache and XPP Execute-RAM). TABLE US 00171 Data XPP Rof Access Configuration 

Execute System Speedup configurations RAM DCacho RAM ICacho Core Cache RAM Cache 
RAM Core Cache RAM vitcrbi29 (unrolling) 1352 52 9688 1377 366 1795 12783 3621 1976 
9.9 2.0 0.1 vitcrbi29 (no unrolling) 1352 52 5132 770 588 1110 8112 3621 1976 6.2 2.6 0.6 
calcmin 536 17 3021 129 56 502 1015 256 792 1.6 0.5 0.2 subtract 760 33 1736 215 76 351 
10 2817 192 952 2.5 0.5 0.3 all cfgs (unrolling) 1110 53 11392 2051 198 2602 18381 1072 5512 
8.2 1.6 0.3 all cfgs (no unrolling) 1110 53 10136 1111 720 2217 13710 1072 5512 5.7 1.8 0.1 

Data AccessConfigurationXPP ExecuteRef SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMviterbi29 

15 (unrolling)13525296881377366179512783362449769.92.00.4viterbi29 (no 

unrolling 1 3525254327705 88 1 4 1 08 142362449766.22.60.6calcmin536 1 7302442956502404525 
67924.60.50.2subtract7603317362457635428171929522.50.50.3all cfgs 
(unrollingM44053143922051498260218381407255128.21.60.3all cfgs (no 
unrollingH44053101361444720221713740407255125.71.80.4 

20 [0590] Usually the update_viterbi29 function is called in a loop. Therefore— in the following 

table—it is assumed that all three configurations are cached in the XPP array for all but the first 
iteration. Additionally the XppSync instruction can be placed after the outer loop, enabling 
pipelining of the memory transfers and the execution. TABLE US 00172 Data XPP Rof. 
Access Configuration Execute System Speedup configurations RAM DCache RAM ICache 

25 Core Cache RAM Cache RAM Core Cache RAM vitcrbi29 (unrolling) 1352 52 366 366 1352 
3621 1976 9.9 9.9 3.7 vitcrbi29 (no unrolling) 1352 52 588 588 1352 3621 1976 6.2 6.2 3.7 
calcmin 536 17 56 56 536 256 792 1.6 1.6 1.5 subtract 760 33 76 76 760 192 952 2.5 2.5 1.3 all 
cfgs (unrolling) 1110 53 198 198 1110 1072 5512 8.2 8.2 3.8 all cfgs (no unrolling) 1110 53 
720 720 1110 1072 5512 5.7 5.7 3.8 

30 

Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 

ICacheCoreCacheRAMCacheRAMCoreCacheRAMviterbi29 

(unrolling) 1352523663661 352362449769.99.93 .7viterbi29 (no 

unrolling)1352525885881352362449766.26.23.7calcmin5361756565362567924.64.61.5subtrac 
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t7603376767601929522.52.51.3all cfgs (unrolling') 144053498498 1440407255 128,28.23.8all 
cfgs (no unrolling)1440537207201440407255125 .75.73.8 

[0591] For viterbi a significant performance improvement up to a factor of 8.2 can be achieved 
using the XPP compared to the reference system. 

5 

[0592] The final utilization is shown in the following tables. The information is taken from the 
\info' files generated from the NML source code by the XMAP tool. TABLE US 00173 
Parameter Value 

10 Utilization of the viterbi29 configuration with unrolling (vit.nml): Vector length road: 

Par ameterV alue Vector lengthread : 3 2 , write: 64 Reused data set size — I/O IRAMs [sum-pet] 8- 
50% ALU [sum-pet] 47-73% BREG [def/route/sum-pct] 27/37/64 - 80% FREG [def/route/sum- 
pct] 24/27/5 1 - 64% Utilization of the viterbi29 configuration without unrolling 
(vitnounroll.nml) : Vector length read: 

15 Par ameterV alue Vector lengthread : 3 2 , write: 64 Reused data set size — I/O IRAMs [sum-pet] 8- 
50% ALU [sum-pet] 25-39% BREG [def/route/sum-pct] 18/23/41 - 51% FREG [def/route/sum- 
pct] 18/1 1/29 - 36% Utilization of the calcmin configuration (min.nml): Vector 
ParameterValueVector length 16 Reused data set size — I/O IRAMs [sum-pet] 2-13% ALU 
[sum-pet] 19-30% BREG [def/route/sum-pct] 14/16/30 - 38% FREG [def/route/sum-pct] 7/6/13 

20 - 16% Utilization of the subtract configuration (sub.nml): Vector 

ParameterValueVector length 16 Reused data set size — I/O IRAMs [sum-pet] 3-19% ALU 
[sum-pet] 11-17% BREG [def/route/sum-pct] 6/10/16 - 20% FREG [def/route/sum-pct] 2/9/11 - 
14% 

5tS-MPEG2 Codec-Quantization 
25 [0593] The quantization file contains may include routines for quantization and inverse 

quantization of 8 .times. 8x8 macro blocks. These functions may differ for intra and non-intra 
blocks . Furthermore , and furthermore the encoder distinguishos m ay distinguish between 
MPEG1 and MPEG2 inverse quantization. 

30 [0591] Since all functions may have the same layout , i.e. J some checks, one main loop running 
over the macro block quantizing with a quantization matrix , wo concentrate on ), focus is placed 
on " iquant intra," the inverse quantization of intra-blocks, since it contains may include all 
elements found in the other procedures. _(The non intra quantization loop bodies are more 
complicated, but add no compiler complexity). In the source code the MPEG m peg l part is 
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already inlined, which is straightforward since the function is statically defined and 
contains includes no function calls itself. Therefore.! the compiler inlinos may inline it- and dead 
function elimination removes may remove the whole definition. 



5 [0595] 5.8.1 Original Code TABLE US 00171 

void iquant_intra(src,dst,dc_prec,quant_mat,mquant) 
short *src, *dst; 
int dc_prec; 

unsigned char *quant_mat; 
1 0 int mquant; 

{ 

int i, val, sum; 
if(mpegl) { 

dst[0] = src[0] « (3-dc_prec); 
15 for (i=l; i<64; i++) 

{ 

val = (int)(src[i]*quant_mat[i]*mquant)/16; 
/* mismatch control */ 
if ((val&l)==0 && val!=0) 
20 val+= (val>0) ? -1 : 1; 

/* saturation */ 

dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); 

} 

I 



25 else 
{ 

sum = dst[0] = src[0] « (3-dc_prec); 
for (i=l; i<64; i++) 

{ 

30 val = (int) (src[i]*quant_mat[i]*mquant)/16; 

sum+= dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : val); 

} 

/* mismatch control */ 
if ((sum&l)==0) 
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dstr631 f circumflex ovor ( )}~ 1; A =l; 

} 

} 



5 [0596] In the following subsections we concentrate on the MPEG2 part. 
5.8.2 Preliminary Transformations 

Interprocedural Optimizations 
[0597] Analyzing the loop bodies shows , it can be seen that they may easily fit eeto the XPP 
1 0 array and do not use the maximum of resources by far. The function is called three times from 
module putseq.c. With inter-module function inliningi the code for the function call 
disappears m ay disappear and is may be replaced with the function. Therefore it reads: TABLE 
US 00175 , it may be as follows: 
for (k=0; k<mb_height*mb_width; k++) { 
1 5 if (mbinfo [k] .mb type & MB INTRA) 

for (j=0; j<block_count; j++) 
if (mpegl) { 

/* omitted */ 

} else { 

20 sum = dst[0] = src[0] « (3-dc_prec); 

for (i=l; i<64; i++) 

{ 

val = (int) (src[i]*quant mat[i]*mquant)/16; 

sum+= dst[i] = (val>2047) ? 2047 : ((val<-2048) ? -2048 : 

25 val); 

} 

/* mismatch control */ 
if((sum&l)==0) 

dst[63] [circumflex over ( )] A = 1; 

30 } 
else 

/* non intra block part omitted */ 

} 

Basic Transformations transformations 
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[0598] The following transformations are done: [0599] 

* A peephole optimization reduces the division by 16 to a right shift by 4. This is essential since 
we do not consider loop bodies containing division for the XPP. [0600] 

* Idiom recognition reduces the statement after the comment /* saturation */ to 
5 dst[i]=min(max(val, -2048), 2047). [0601] 

* Since the global variable mpegl does not change within the loop, loop unswitching moves the 
control statement outside the j-loop and produces two loop nests. [0602] 

* Partial redundancy elimination inserts temporaries which store intermediate results. [0603] 

* Reads from arrays are stored in temporaries and moved as early as possible. [060^1] 
10 * Writes to arrays are moved as late as possible. 

[0605] Below is the code after these three transformations. The MPEG1 part again is omitted, 
but looks similar. TABLE US 00176 
for (k=0; k<mb_height*mb_width; k++) { 
1 5 if (mbinfo [k] .mb type & MB INTRA) 

if (mpegl) 

/* omitted */ 

else 

for (j=0; j<block_count; j++) { 
20 blockdata = blocks[k*block_count+j][0]; 

tmpl = block data « (3-dc_prec); 
sum = tmp 1 

blocks[k*block count+j][0] = tmpl; 
for(i=l;i<64; i++) { 

25 blockdata = blocks[k*block_count+j][i]; 

matdata = intra q [i]; 

val = (int)( block data * mat data *mquant)»4; 
tmp2 = min(2047, max(-2048,val)); 
sum += tmp2; 

30 blocks [k*block_count+j][i] = tmp2; 

} 

/* mismatch control */ 

blockdata = blocks[k*block_count+j][63]; 

if ((sum&l)==0) { 
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block data [circumflex ovor ( )] _ = 1 ; 

} 

blocks[k*block_count+j][63] = blockdata; -}- 

I 

5 

[0606] The i-loop is candidate to run on the XPP array, therefore we try to increase the size of 
the loop body as much as possible. Before we increase parallelism the next subsection shows an 
optimization which transforms the loop nest into a perfect loop nest. 

1 0 Inverse Loop-Invariant Code 

Motion 

[0607] The loop-invariant statements surrounding the loop body are candidates for inverse loop 
invariant code motion. By moving them into the loop body and guarding them properly the loop 
15 nest gets perfect, and the utilization of the innermost loop increases. Since this optimization is 
reversible it can be undone whenever needed. 

[0608] This time we only show the two innermost loop nests. TABLE US 00177 
for (j=0; j<block_count; j++) { 
20 for (i=0; i<64; i++) { 

blockdata = blocks[k*block_count+j][i]; 
matdata = intra q [i]; 
sol 0 = block data « (3 -de _prec); 
sol i 63 = (int)( block data * mat data *mquant)»4; 
25 sat_l_63 = min(2047, max(-2048,sol_l_63)); 

guard 1 = (i==0); 
guard2 = (i==63); 
if (guard 1) 

sol = sol_0; 

30 else 

sol = sat_l_63; 
if (guard 1) 

sum = sol; 

else 
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sum += sol; 
guarcB = ((sum & 1) == 0); 

if (guard2 && guard3) sol (circumflex over ( )] ~ 1 



sol A = 1 



5 



blocks[k*block_count+j][i] = sol; 



} 



} 



[0609] The following table shows the estimated utilization and performance by a configuration 
10 synthesized from the inner loop. The values show that there are many resources left for further 
optimizations . TABLE US 00178 Parameter Value Vector 

ParameterValueVector length 32 (64 16-bit values) Reused data set size — I/O IRAMs 4 ALU 9 



[0610] To increase parallelism we have two possibilities, which can be combined: [061 1] 

* Since the smallest data type used in the inner loop limits the throughput of the synthesized 
20 pipeline, we must try to improve this throughput. This is shown in the next subsection. [0612] 

* The j-loop nest is candidate for unroll-and-jam when interprocedural value range analysis 
finds out that block count can only have the values 6, 8 or 12. Loop Distribution, Partial 
Unrolling, Reduction Recognition, Loop Fusion 

25 Loop Distribution, Partial Unrolling, Reduction Recognition, Loop Fusion 

[0613] The conversion of the 8 -bit values due to the unsigned character array containing the 
quantization matrix limits the throughput of the pipeline. In the best case only every fourth 
cycle a value can be read or written from the IRAM. Therefore we must try to increase the 
throughput by splitting the 32-bit value into 8-bit values, and process them concurrently in 

30 different pipelines. Unfortunately the loop-carried true dependence due to the accesses to sum 
prevents a simple partial unrolling which would achieve this. Loop distribution overcomes this 
problem. 

Loop Distribution 
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BREG 9 FREG fow Dataflow 9FREGfewData flow graph width-4^ 

height 7 + 2 (converters) Configuration cycles 9 + 64 

5.8.3 Enhancing Parallelism 



4Data flow graph 
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[0611] Since there is no dependence from a read of sum to a write of block data in the code, it 
is possible to distribute the innermost loop into two loops. The first loop also absorbs the 
guarded loop-invariant code which represents the first iteration. TABLE US 00179 

for (j=0; j<block_count; j++) { 
for (i=0; i<64; i++) { 

blockdata = blocks[k*block_count+j][i]; 

matdata = intra q [i]; 

sol_0 = block data « (3-dc_prec); 

sol_l_63 = (int)( block data * mat data *mquant)»4; 

sat 1 63 = min(2047, max(-2048,sol 1 63)); 

guard 1 = (i==0); 

if (guard 1) 

sol = sol_0; 

else 

sol = sat_l_63; 
blocks[k*block_count+j][i] = sol; 

} 

for (i=0; i<64; i++) { 

blockdata = blocks[k*block_count+j][i]; 

guard 1 = (i==0); 

if (guard 1) sum ~ block data; 

sum = block data; 

else 

sum += blockdata; 

} 

/* mismatch control */ 

blockdata = blocks[k*block_count+j][63]; 

if ((sum&l)==0) { 

block data [circumflex over ( )) ^= 1 ; 

} 

blocks[k*block_count+j][63] = blockdata; 

} 
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[0615] N ow the first generated loop can be partially unrolled, while the second one is a classical 
example for sum reduction. 

5 Loop 1 —Partial Unrolling 

[0616] The first loop utilizes about 10 ALUs (including 32-to-8 bit-conversion). Therefore the 
unrolling factor would be limited to 6. The next smaller divisor of the loop count is 4. Assuming 
this factor would be taken, another restriction gets valid. The factor causes that four block data 
1 0 values are read and written in one iteration. Although this could be synthesized by means of 
shift register synthesis or data duplication for the reads, the writes would cause either an 
undefined result at write-back, if written to two distinct IRAMs, or the merge of the values 
would half the throughput. Therefore the unrolling factor chosen is 2, reaching the maximum 
throughput with minimum utilization. 

15 

[0617] Dead code elimination removes the guarded statement for the parts representing the odd 
iteration values. TABLE US 00180 
for (i=0; i<64; i+=2) { // unrolled once 

// iteration i==0,2,4.... 
20 blockdataO = blocks[k*block_count+j][i]; 

mat data O = intraq [i]; 

sol O O = block data O « (3-dc_prec); 

sol 1 63 0 = (int)( block data 0 * mat data 0 *mquant)»4; 
sat_l_63_0 = min(2047, max(-2048,sol_l_63_0)); 
25 guardlO = (i==0); 

if (guard 1_0) 

sol_0 = solOO; 

else 

sol_0 = sat_l_63_0; 
30 blocks[k*block_count+j][i] = sol_0; 

// iteration i==l ,3,5 

blockdatal = blocks[k*block_count+j][i+l]; 

mat data l = intra q [i+1]; 

sol 0 = block data 1 « (3-dc_prec); 
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sol_l_63_l = (int)( blockdatal * mat data 1 *mquant)»4; 
sat_l_63_l = min(2047, max(-2048,sol_l_63_l)); 
blocks[k*block_count+j][i+l] = sat_l_63_l; 

} 

5 

Loop2— Sum Reduction 

[0618] As upon the block data write limits the reduction possibilities, therefore the code 
transforms to TABLE US 00181 
for (i=0; i<64; i+=2) { 
10 blockdataO = blocks[k*block_count+j][i]; 

blockdatal = blocks[k*block_count+j][i+l]; 
guard 1 = (i==0); 
if (guard 1) 

sum = blockdataO + blockdatal; 

15 else 

sum += blockdataO + blockdatal; 

} 

Loop Fusion 

20 [0619] The new loops can then be merged again, because still no dependence exists between 
them. Furthermore the loop -invariant code following the loops is moved inside the loop body, 
producing a perfect loop nest. TABLE US 00182 
for (j=0; j<block count; j++) { 

for (i=0; 1<64; i+=2) { // unrolled once 
25 blockdataO = blocks[k*block_count+j][i]; 

blockdatal = blocks[k*block_count+j][i+l]; 
matdataO = intraq [i]; 
mat data l = intra q [i+1]; 
// i== 0,2,4 

30 sol O O = block data O « (3-dc_prec); 

sol_l_63_0 = (int)( block data O * mat data O *mquant)»4; 
sat_l_63_0 = min(2047, max(-2048,sol_l_63_0)); 
guardO = (i=0); 
if (guardO) 
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sol_0 = solOO; 

else 

sol_0 = sat_l_63_0; 
sol_0 = blockdatal « (3-dc_prec); 
5 //i — 1,3,5 

sol_l_63_l = (int)( block data l * matdatal *mquant)»4; 
sol_l = min(2047, max(-2048,sol_l_63_l)); 
guard2 = (i == 62); 
guard3 = ((sum & 1) == 0); 
10 if (guard2 && guard3) 

sat_l_63_3 (circumflex over ( )]^ ;= 1 
blocks [k*block count+j][i] = sol 0; 
blocks[k*block_count+j][i+l] = sat_l_63_l; 

} 

15 

[0620] As can be seen in the next table, these transformations have almost doubled the 
utilization and performance. TABLE US 00183 Parameter Value Vector 

ParameterValueVector length 32 (64 16-bit values) Reused data set size — I/O IRAMs 4 ALU 
20 18 BREG 1 1 FREG 4 Dataflow graph width 8 Dataflow graph height 9 + 4 (converters) 
Configuration cycles 13 + 32 
Unro 11-and- Jam 

[0621] As said above, the j-loop nest is candidate for unroll-and-jam when interprocedural 
25 value range analysis finds out that block count can only have the values 6, 8 or 12. Therefore it 
has a value range [6,12] with the additional property to be dividable by 2. Thus unroll-and-jam 
with an unrolling factor equal to 2 is applicable. If should be noted that the resource constraints 
would give a bigger value. Since no loop-carried dependence at the level of the j-loop exists, 
this transformation is safe. Please note that redundant load/store elimination removes the loop- 
30 invariant duplicated loads from the array intra q and the scalars dc_prec and mquant. TABLE 
US 00181 

for (j=0; j<block_count; j+=2) { // unrolled and jammed once 
for (i=0; i<64; i+=2) { // unrolled once 
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// common code 
matdataO = intraq [i]; 
matdatal = intra q [i+1]; 
guard 1 = (i==0); 
guard2 = (i == 62); 
// j == 0,2,... 

blockOdataO = blocks[k*block_count+j][i]; 
blockOdatal = blocks[k*block_count+j][i+l]; 
// i= 0,2,4 

so 10 0 = blockOdataO « (3-dc_prec); 

sol0_l_63_0 = (int)( blockO data O * mat data O *mquant)»4; 
satO 1 63 0 = min(2047, max(-2048,sol0 1 63 0)); 
if (guard 1) 

solOO = solOOO; 

else 

solOO = sat0_l_63_0; 
//i== 1,3,5 

sol0_l_63_l = (int)( blockO data l * mat data l *mquant)»4; 
solO l = min(2047, max(-2048,sol0_l_63_l)); 
if (guard 1) 

sumO = solOO + solOl; 

else 

sumO += solO 0 + solO 1; 
guard3 = ((sumO & 1) == 0); 

if (guard2 && guard3) solO l {circumflex over ( )] - 1; 
blocks[k' |; block_count+j][i] ~ solO O; blocks[k' |; block_count+j][i+l] ~ solO l; 

solO 1 A = 1; 

blocks Tk*block count+iiril = solO 0; 

blocks rk*block count+iiri+11 = solO 1; 

//j == 1,3,- 

blockldataO = blocks[k*block_count+j+l][i]; 
blockldatal = blocks[k*block_count+j+l][i+l]; 
// i= 0,2,4 

soli 0 0 = blockl data 0 « (3-dcjrec); 
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soll_l_63_0 = (int)( block data 0 * mat data O *mquant)»4; 
satl_l_63_0 = min(2047, max(-2048,soll_l_63_0)); 
if (guard 1) 

sollO = sollOO; 

5 else 

sol 10 = satl_l_63_0; 
lli== 1,3,5 

soll_l_63_l = (int)( blockl data l * mat data l *mquant)»4; 
soll_l = min(2047, max(-2048,soll_l_63_l)); 
10 if (guard 1) 

suml = sollO + solll; 

else 

suml += sollO + solll; 
guard4 = ((suml & 1) == 0); 
15 if (guard2 && guard4) sol 11 {circumflex over ( )] — 1 

blocks [k*block_count+j][i] ~ sol_0; blocks[k*block_count+j][i+l] ~ soll l; ) 

soli 1 A = 1 

b locks [k*block_count+j][i] = sol_0; 

blocks[k*block_count+j][i+l] = solll; 

20 1 

[0622] The results of the version where unroll-and-jam is applied are shown in the following 
table. TABLE US 00185 Parameter Value Vector 

25 ParameterValueVector length 2 * 32 (2 * 64 16-bit values) Reused data set size — I/O IRAMs 5 
ALU 36 BREG 22 FREG 8 Dataflo w 8Data flow graph width 2 *8 Dataflow * 8Data flow graph 
height 9 + 4 (converters) Configuration cycles 13 + 32 
SrS^Final Code 

[0623] The RISC code contains only the outer loops control code and the preload and execute 
30 calls. Since the data besides the block data does not vary within the j-loop, and the XPP FIFO 
initially sets the IRAM values to the previous preload, redundant load/store elimination moves 
the preloads in front of the j-loop. The same is done with the configuration preload. The RISC 
code looks then like: TABLE US 00186 
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for (k=0; k<mb_height*mb_width; k++) { 

if (mbinfo[k].mb_type & MB INTRA) 
if (mpegl) 
/* omitted */ 
5 else { 

XppPreloadConfig(_XppCfg_iquant_intra_mpeg2); 
XppPreload(2, &intra_q, 16); 
XppPreload(3, &mbinfo[k].mquant, 1); 
XppPreload(4, &dc_prec, 1); 
1 0 for (j=0; j<block_count; j+=2) { 

XppPreload(0, &blocks[k*block_count + j], 32); 

XppPreload(l, &blocks[k*block count + j+l], 32); 

XppExecute( ); 
} xppSync 

15 XppSync (&blocks[k*block count"!, 64 * blockcount); 

} 

[0621] The configuration code reads: TABLE US 00187 
void _ xppC fg XppCfg_ iquant_intra_mpeg2( ) 
20 { 

// IRA Ms 

// blocks[k*block_count+j] and blocks[k*block_count+j+l], respectively 
// Read access with splitter to two 16 bit packets. 
// iram0,l[i] and iram0,l[i+l] are available concurrently. 
25 short H^moiramO [256], iraml [256]; 

// intra_q 

// Read access with splitter to 4 8-bit streams remerge to 2 streams. 
// iram2[i] and iram2[i+l] are available concurrently, 
unsigned char iram2[512]; 
30 int iram3[128], iram4[128]; // scalars mquant and dc_prec 

// temporaries 
int i; 

int solOOO, solOOl, solOO, solO l; 
int soli 0 0, soli 0 1, soli 0, soli 1; 
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int sol0_l_63_0, solO 1 63 1, sat0_l_63_0; 
int soll_l_63_0, soll_l_63_l, satl_l_63_0; 
int sumO, suml; 

event guardl, guard2, guard3, guard4; 
5 for (i=0; i<64; i+=2) { // unrolled once 

// common code 
guardl = (i==0); 
guard2 = (i == 62); 
II) == 0,2,... 

10 //i= 0,2,4 

solOOO = iram0[i] « (3-iram3[0]); 
solO 1 63 0 = (int)( iram0[i] * iram2[i] * iram4[0])»4; 
sat0_l_63_0 = min(2047, max(-2048,sol0_l_63_0)); 
if (guardl) 

15 solOO = solOOO; 

else 

solOO = sat0_l_63_0; 
//i== 1,3,5 

sol0_l_63_l = (int)( iram0[i+l] * iram2[i+l] * iram4[0])»4; 
20 solOl = min(2047, max(-2048,sol0_l_63_l)); 

if (guardl) 

sumO = solOO + solOl; 

else 

sumO += solOO + solOl; 
25 guard3 = ((sumO & 1) == 0); 

if (guard2 && guard3) 

solO l (circumflox over ( ))~ 1; iramlfil ~ soll O; A = 1; 

iramlTil = soli 0; 

iraml[i+l] =soll_l; 
30 // part for odd j values omitted 
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[0625] FIG. 26 shows the dataflow graph of one branch of the configuration. The different 
sections are colored for convenience. 

5.8.5 Performance Evaluation 
5 [0626] The next table lists the estimated performance of data transfers. The values assume that 
each read causes a cache miss, i.e. that the cache does not contain any data before the first 
preload occurs. The startup preloads section contains the preloads before the j-loop and the 
preloads of the block data in the first iteration. On the other hand the steady state preloads and 
write -backs describe the preloads and write-backs in the body of the j-loop. TABLE US 00188 
10 RAM 

DataSize Tbytes] Cache Misses RAM to Cache |"cache cycles] IRAM Size Cache [cache [cache 
Data [bytos] Misses cycles] cycles] Startup preloads intraq 6^1 2 112-1 mbinfo [k] .mquant 1 1 
56 1 dc_prcc 1 1 56 1 Sum 221 6 Steady State Preloads blocks[k*block_count + j] 128 1 221 8 
15 blocks [k*block_count + j+1] 128 1 221 8 Sum 118 16 Steady State Writebacks 

blocks[k*block_count+j] 128 128 8 blocks[k*block_count + j+1] 128 128 8 Sum 256 16 
[cache cycles] Startup 

Preloadsintra_q642 1 1 24mbinfo[k] .mquant4 1561 dc_prec4 1561 Sum2246Steady State 
Preloadsblocksrk*block count+il 12842248blocks[k*block count+i+1 1 12842248Sum448 1 6Ste 
20 adv State 

Writebacksblocks[k*block count+i]1281288blocks[k*block count+i+111281288Sum25616 
[0627] The write-back of the block data causes no cache miss, because the cache line was 
already loaded by the preload operation. Therefore the write -back does not include cycles for 
write allocation. 

25 

[0628] To compare the performance with the reference system we define some assumptions. 
The cycle count of one iteration of the k-loop is measured. As said upon the value of 
block count has a maximum value of 12. This means that XppExecute is called 6 times in one 
iteration, since the configuration works on two blocks concurrently. Thus the total cycles 
30 calculate to the sum of the loads and 6 times the maximum of the steady state preloads and the 
execution cycles. 

[0629] The execution cycles were measured by mapping and simulating the hand written 
XppCfg iquant intra mpeg2 configuration, where a special start object ensures that 
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configuration buildup and execution do not overlap. Experiments showed that it is valuable to 
place distinct counters everywhere where, the iteration count is needed. The short connections 
that can be routed have a great impact on the execution speed. This optimization can be done 
easily by a compiler. Another relatively simple optimization was done by manually placing the 
5 most important parts of the dataflow graph. 

[0630] Although this is not as simple as the optimization before, the performance impact of 
almost 100, cycles seems to make it to a required feature for a compiler. 

1 0 [0631] The simulation yields 110 cycles for the configuration execution, which must be doubled 
to scale it to the data transfer cache cycles. A multiplication by 6 yields the final execution 
cycles for one iteration of the k-loop. 

[0632] The results are summarized in the following table. TABLE US 00189 Data Confi XPP 

15 Ref Access guration Execute System Speedup configurations RAM DCache RAM ICache Core 
Cache RAM Cache RAM Core Cache RAM startup 221 6 1960 1 1 17 1 1 17 2181 steady state 
672 32 220 220 672 sum 1256 1320 2137 6216 17611 21867 13.2 7.2 3.5 
[0633] Data AccessConfigurationXPP ExecuteRef 
SystemSpeedupconfigurationsRAMDCacheRAM 

20 ICacheCoreCacheRAMCacheRAMCoreCacheRAMstartur>2246 1 960 1 1 1 7 1 1 1 72 1 84steadv 
state67232220220672sum4256 1 320243762 16176112186713.37.23.5 This table describes the 
worst case. All data must be loaded from RAM. When we assume that the configuration is 
loaded from cache, which is an accurate assumption because it mainly alters with the 
configuration for non intra coded blocks, the statistics look much better. Since the quantization 

25 matrix and the scaling constants also stay in the cache, their preloads do not burden the cache- 
RAM bus as well. TABLE US 00190 Data Confi XPP Ref Access guration Execute System 
Speedup configurations RAM DCache RAM ICache Core Cache RAM Cache RAM Core 
Cache RAM startup 6 1117 1117 1117 steady state 672 32 220 220 672 sum 1032 1320 2137 
5119 17611 21613 13.3 7.2 1.2 

30 Data AccessConfigurationXPP ExecuteRef SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMstartup61 1171 1 171 1 17steady 
state67232220220672sum4032 1 32024375 1 49 1 76 1 1 2 1 643 1 3 .37.24.2 
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[0631] The final utilization is shown in the following table. The big differences with the 
estimated values for the BREGs and FREGs result from the distributed counters. TABLE US 
00191 Parameter Value Vector 

ParameterValueVector length 2 * 32 (2 * 64 16-bit values) Reused data set size — I/O IRAMs 
5 [sum-pet] 5-31% ALU [sum-pet] 39-61% BREG [def/route/sum-pct] 39/14/53 - 66% FREG 
[def/route/sum-pct] 20/16/36 - 45% 
5r9-MPEG2 Codec-IDCT 

[0635] The idct-algorithm (inverse discrete cosine transformation) is used for the MPEG2 video 
10 decompression algorithm. It operates on 8.times.8 blocks of video images in their frequency 
representation and transforms them back into their original signal form. The MPEG2 decoder 
contains a transform- function that calls idct for all blocks of a frequency-transformed picture to 
restore the original image. 

15 [0636] The idct function consists of two for-loops. The first loop calls idctrow— the second 
idctcol. Function inlining is able to eliminate the function calls within the entire loop nest so 
that the numeric code is not interrupted by function calls anymore. Another way to get rid of 
function calls in the loop nest is loop embedding that pushes loops from the caller into the 
callee. 

20 

[0637] 5.9.1 Original Code (idct.c) TABLE US 00192 
/* two dimensional inverse discrete cosine transform */ 
void idct(block) 
short *block; 



25 



{ 



int i; 



for (i : 



0; i<8; i++) 

idctrow(block+8*i); 
0; i<8; i++) 

idctcol(blo ck+i) ; 



for (i : 



30 



} 
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[0638] The first loop changes the values of the block row by row. Afterwards the changed block 
is further transformed column by column. All rows have to be finished before any column 
processing can be started (see FIG. 27). 

5 [0639] Data dependence analysis detects true data dependences between row processing and 
column processing. Therefore processing of the columns has to be delayed until all rows are 
done. The innermost loop bodies of idctrow and idctcol are nearly identical. They process 
numeric calculations on eight input values, column values in the case of idctcol and row values 
in the case of idctcol. Eight output values are calculated and written back (as column/row) 
10 idctcol additionally applies clipping before the values are written back. This is why we 
concentrate on idctcol: TABLE US 00193 

/* column (vertical) IDCT 

* .times. * * * .times. 

15 * 7 

* dst .function. [ 8 * k ] = sum 1 ~ 0 7 .times, .times, c .function. c[ 1 ] .times. * .times., ! src 
.function. [ 8*1] .times. * .times.! cos .times, .times. ( pi /_8 * ( k + 1 l_2 ) .times. * .times.. ! 1 ) 

* ]=0 

* where: c[0] = 1/1024 

20 * c[1..7] = (l/1024)*sqrt(2) 

*/ 

static void idctcol (blk) 
short *blk; 

{ 

25 int xO, xl, x2, x3, x4, x5, x6, x7, x8; 

/* shortcut */ 

if (!((xl = (blk[8*4]«8)) | (x2 = blk[8*6]) | 

(x3 = blk[8*2]) | (x4 = blk[8*l]) | (x5 = blk[8*7]) | 
(x6 = blk[8*5]) | (x7 = blk[8*3]))) 

30 { 

blk[8*0] =blk[8*l] =blk[8*2] =blk[8*3] =blk[8*4] =blk[8*5] = 

blk[8*6] =blk[8*7] =iclp[(blk[8*0] =) »6]; 

return; 

} 
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xO = (blk[8*0] «8) + 8192; 

/* first stage */ 

x8 = W7*(x4+x5) + 4; 

x4 = (x8+(Wl-W7) *x4) »3; x5 ~ (x8 (W1+W7) *x5) »3; x8 ~ W3* (x6+x7) + 1; x6 
- (x8 (W3 W5) *x6) »3; x7 ~ (x8 (W3+W5) *x7) »3; 
x5 = (x8-(Wl+W7^ *x5^ »3: 
x8 = W3* fx6+x7) + 4; 
x6 = fx8- (W3-W5) *x6) »3: 
x7 = (x8- (W3+W5) *x7) »3; 
/* second stage */ 
x8 = xO + xl; 
xO -=xl; 

xl = W6* (x3+x2) + 4; x2 - (xl (W2+W6) *x2) »3; x3 - (xl+ (W2 W6) *x3) »3; 
xl ~ x1 + x6; x^l — x6; x6 — x5 + x7; x5 ~ x7; 

x2 = (xl- (W2+W6) *x2) »3; 

x3 = (xl+ (W2-W6) *x3^ »3: 

xl = x4 + x6; 

x4 -= x6; 

x6 = x5 + x7; 

x5 -= x7; 

/* third stage */ 

x7 = x8 + x3; 

x8 -= x3; 

x3 = xO + x2; 

xO -= x2; x2 ~ (181*(x1+x5) +128) »8; x1 - (181*(x4 x5) +128) »8; 

x2 = (181*(x4+x5) +128) »8: 

x4 = (181*(x4-x5) +128^ »8: 

/* fourth stage */ 

blk[8*0] = iclp[(x7+xl) »14]; 

blk[8*l] = iclp[(x3+x2) »14]; 

blk[8*2] = iclp[(x0+x4) »14]; 

blk[8*3] = iclp[(x8+x6) »14]; 

blk[8*4] = iclp[(x8-x6) »14]; 

blk[8*5] = iclp[(x0-x4) »14]; 
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blk[8*6] = iclp[(x3-x2) »14]; 
blk[8*7] = iclp[(x7-xl) »14]; 

} 



5 [06 / IO] W1-W7 are macros for numeric constants that are substituted by the preprocessor. Array 
iclp is used for clipping the results to 8-bit values. It is fully defined by the init idct function 
before idct is called the first time: TABLE US 00191 

void init_idct( ) 



[0611] A special kind of idiom recognition, function recognition, is able to replace the 
calculation of each array element by a compiler known function that can be realized efficiently 
on the XPP. If the compiler features whole program memory aliasing analysis, it is able to 
20 replace all uses of the iclp array with the call of the compiler known function. Alternatively a 
developer can replace the iclp array accesses manually by the compiler known saturation 
function calls. The illustration shows a possible implementation for saturate(val,n) as NML 
schematic using two ALUs. In this case it is necessary to replace array accesses like iclp[i] by 
saturate(i,256), see FIG. 28. 

25 

[0612] The /* shortcut */ code in idctcol speeds column processing up if xl to x7 are equal to 
zero. This breaks the well-formed structure of the loop nest. The if-condition is not loop- 
invariant and loop unswitching cannot be applied. But nonetheless, the code after shortcut 
handling is well suited for the XPP. It is possible to synthesize if-conditions for the XPP, 
30 speculative processing of both blocks plus selection based on condition, but this would just 

waste PAEs without any performance benefit. Therefore the /* shortcut */ code in idctrow and 
idctcol has to be removed manually. The code snippet below shows the inlined version of the 
idctrow-loop with additional cache instructions for XPP control: TABLE US 00195 



10 { 



int i; 
iclp = 
for (i : 



iclip+512; 
: -512; i<512; i++) 
iclp[i] = (i<-256) ? -256 : ((i>255) ? 255 : i); 



15 } 
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void idct(block) 
short *block; 

{ 

int i; 

5 XppPreloadConfig(_XppCfg_idctrow); // Loop Invariant 

for (i=0; i<8; i++) { 
short *blk; 

int xO, xl, x2, x3, x4, x5, x6, x7, x8; 
blk = block+8*i; 
10 XppPreload(0, blk, 8/2); // 8 shorts = 4 ints 

XppPreloadClean(l, blk, 8/2); // IRAM1 is erased and assigned to blk 
XppExecute( ); 

} 

for(i=0; i<8; i++) { ... 
15 } 

} 

[06 / l3] As the configuration of the XPP does not change during the loop execution invariant 
code motion has moved out XppPreloadConfig(_XppCfg_idctrow) from the loop. 

20 

5.9.2 Enhancing XPP Utilization 

[06^11] As mentioned at the beginning idct is called for all data blocks of a video image (loop in 
transform. c). This circumstance allows us to further improve the XPP utilization. 

25 

[06^15] When we look at the dataflow graph of idctcol in detail we see that it forms a very deep 
pipeline. XppCfg idctrow runs only eight times on the XPP which means that only 64 (8 times 
8 elements of a column) elements are processed through this pipeline. Furthermore all data must 
have left the pipeline before the XPP configuration can change to the XppCfg idctcol 
30 configuration to go on with column processing. This means that something is still suboptimal in 
the example. 

Pipeline Depth 
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[0616] The pipeline is just too deep for processing only eight times eight rows. Filling and 
flushing a deep pipeline is expensive if only little data is processed with it. First the units at the 
end of the pipeline are idle and then the units at the begin are unused (see FIG. 29). 

5 Loop Interchange and Loop Tilling 

[0617] It is profitable to use loop interchange for moving the dependences between row and 
column processing to an outer level of the loop nest. The loop that calls the idct- function in 
transform.c on several blocks of the image has no dependence preventing loop interchange. 
10 Therefore this, loop can be moved inside the loops of column and row processing. TABLE US 
00 1 96 

// transform, c 

for (n=0; ri<block count; n++) { 
15 idct (bl:ocks Tk*block count.+n]) ;. // block count is 6 or 8 or 12 
// idct.c 
for (i=Q: i<8 : 

idctrow (black+8*i) ; for (i=0;. <8; i.++) ( idctcol (block+i ): ; 

20 [0618] N ow processing of rows and columns can be applied on more data by applying loop 

tiling, and the fixed costs for filling and flushing the pipeline contribute less to the total costs. 

Constraints (Cache Sensitive Loop Tiling) 

25 [0619] The cache hierarchy has to be taken into account when we define the number of blocks 
that will be processed by XppCfgidctrow. Remember, that the same blocks in the subsequent 

XppCfgidctcol configuration are needed! We have to take care that all blocks that are 
processed during XppCfg idctrow fit into the cache. Loop tiling has to be applied with respect 
to the cache size so that the processed data fit into the cache for all three configurations. 

30 

#t9t3-NML Code Generation 
Dataflow Graph 
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[0650] As idctcol is more complex due to clipping at the end of the calculations, we decided to 
take idctcol as representative loop body for a presentation of the dataflow graph. 

[0651] FIG. 30 shows the dataflow graph for XppCfg idctcol. A heuristic has to be applied to 
5 the graph to estimate the resource needs on the XPP. In our example the heuristic produces the 
following results: TABLE US 00197 ADD, SUB MUL 

Add.SUBMUL «X. » X Saturatc XSaturate fx. n) Ops needed 35 1 1 18 8 ALUs FREGs 
BREGs Ros. avail. 61 80 80 Ros. loft 19 80 15 Ros. usod 15 0 35 Ops 
10 needed35 1 1 1 88ALUsFREGsBREGsRes. Avail. 648080Res. Leftl98045Res. Used45035 

Address Generation, Data Duplication and Data Layout Transformation: 

[0652] To fully synthesize the loop body we have to face the problem of address generation for 
15 accessing the data of four 8. times. 8 blocks. 

[0653] For idctrow and idctcol we have to access one row/column per cycle to get a fully 
utilized pipeline. As the rows/columns are packed, i.e. one row/column is packed into four 
words, we use 4-times data duplication, as described in the hardware section), to enable 4-times 
20 parallel access which is needed to fetch a full row/column (eight short values) per cycle. 

[0651] We use one counter per. RAM to realize address generation. The four counters are 
started with different offsets as they correspond to different elements of the fetched row/column 
(elements of the row/column are packed columns/rows). Therefore we implemented a counter 
25 macro that has a configurable start, stop and increment value, and fits into the same PAE as the 
IRAM. Detailed descriptions of the used macros are given in the appendix. 

[0655] The fetched row/column has to be unpacked with split macros. A split macro splits 
packets of two shorts in an input stream into two separate streams. Now eight input values are 
30 processed to the dataflow graph and eight result values (shorts) are created. 

[0656] Address generation for writing back the results is not needed, as we connect the eight 
result streams to FIFO mode IRAMs which are mapped to one continuous address range. Before 
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the results are written into the FIFO, packing is applied to provide packed input data for the 
next configuration. 

[0657] Unfortunately this combination of reading data duplicated IRAMS in RAM -mode, and 
5 writing the results into FIFOs cause changes in the data layout of the input array. We have to 
ensure that after all data processing the original data layout is recovered. For this reason we 
need an extra configuration which restores the original data layout of the input array. This is 
done in XppCfgidctreorder that also performs the saturation of idctcol to make the 
configuration for idctcol a bit smaller. 

10 

[0658] FIG. 31 illustrates the data layout changes during the whole process. After applying the 
last configuration the data layout is the same as before. 

5.9.^1 Architectural Parameters 

15 

[0659] The following section shows the architectural parameters used by the compiler driver. 
This values are based on heuristics and may not exactly meet the final results. These are just 
start values for the optimizations process. TABLE US 00198 Parameter Value 
XppCfgidctrow Vector 

20 ParameterValueVector lengt h A words Reuse d 4 wordsReused data set size 4 .times . x 8 .times . x 
4 words I wordsl /O IRAMs 4 (data duplication) + 8(output) :ALU 31(dfg) + 8(pack) J3REG 
32(dfg) + 8(pack)^-+8(unpack)^-+4(addr.sel.) FREG 0(dfg) + 8(pack)^+4(unpack)^ 
+4(addr.sel.) Dataflow graph width 8 Dataflow graph height 10 Configuration cycles 128/4 + 
10 .times . x 2 

25 XppCfg idctcol V ector 

ParameterValueVector lengt h 1 words Rcusc d 4 wordsReused data set size 4 .times . x 8 .times . x 
4 words I wordsl /O IRAMs 4 (data duplication) + 8(output) ALU-*7-3!(dfg) + 8(pack) BREG 
3432(dfg) + 8(pack)^-+8(unpack)^-+4(addr.sel.) FREG 0(dfg) + 8(pack)^-+4(unpack)^ 
30 +4(addr.sel.) Dataflow graph width 8 Dataflow graph height 10 Configuration cycles 128/4 + 
10 .times . x 2 

XppCfg idctreorde r Vector 
ParameterValueVector lengt h 1 words Reuse d 4 wordsReused data set size 4 .times . x 8 .times . x 
4 words I wordsl /O IRAMs 4 (data duplication) + 8 (output) ALU 16(dfg) + 8(pack) BREG 
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8(dfg) + 8(pack)^-+8(unpack)^-+4(addr.sel.) FREG O(dfg) + 8(pack)^+4(unpack)^ 
+4(addr.sel.) Dataflow graph width 8 Dataflow graph height 2 Configuration cycles 128/4 + 2 
.times . x 2 

[0660] Total estimated optimal configuration cycles (considering no routing delays and pipeline 
5 stalls) for processing 4 blocks: 

2.timcs.(128/4+10.times.2)+128/4+2.times.2=140 cycles 

Example Source Code after Transformations 

10 [0661] The following sources result from applying the optimizations discussed above. As the 
IRAM size is finally fixed to 128 words we can only process 4 blocks at once. The original 
source code has to be adapted to make this block size possible. 

Transform 

15 

[0662] Finally the idct-function gets completely inlined in the itransform function of 
transform.c. If block count is equal to 4, and we assume that 32*4 words do not exceed the 
cache size, then we can transform the example into: TABLE US 00199 

20 /* inverse transform prediction error and add prediction */ 

void itransform(pred,cur,mbi,b locks) 

unsigned char *pred[ ],*cur[ ]; 

struct mbinfo *mbi; 

short blocksf ][64]; 
25 { 

int i, j , i 1 , j 1 , k, n, cc, offs, lx; 
short *block, *nextblock; 
k = 0; 

for(j=0;j<height2;j+=16) 
3 0 for (i=0; i<width; i+= 1 6) 

{ 

if(block_count == 4) { // xpp execution only if blockcount is 4 
XppPrelo adconfig(_XppC fgidctro w) ; 
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// hide cache miss with preloading next 4 blocks (if not last 

nextblock = blocks [(k+1) * 4]; 

if(i+16 >= width) XppPreload(l, nextblock, 128); 

// do processing of actual 4 blocks 

block = blocks [k * 4]; 

// Input Data 

// IRAMs 0,2,4,6 = 0x55 = 0M010101 

XppPreloadMultiple(0x55, block, 128); // this one causes a read 
// Output Data 

XppPreloadClean( 8, &block[0*16], 16); 
XppPreloadClean( 9, &block[l*16], 16); 
XppPreloadClean(10, &block[2*16], 16); 
XppPreloadClean(ll, &block[3*16], 16); 
XppPreloadClean(12, &block[4*16], 16); 
XppPreloadClean(13, &block[5*16], 16); 
XppPreloadClean(14, &block[6*16], 16); 
XppPreloadClean(15, &block[7*16], 16); 
XppExecute( ); 

XppPrelo adConfig(_XppC fgidctcol) ; 
// Input Data 

// IRAMs 0,2,4,6 = 0x55 = 0M010101 
XppPreloadMultiple(0x55, block, 128); 
// Output Data 

XppPreloadClean( 8, &block[0*16], 16); 
XppPreloadClean( 9, &block[l*16], 16); 
XppPreloadClean(10, &block[2*16], 16); 
XppPreloadClean(ll, &block[3*16], 16); 
XppPreloadClean(12, &block[4*16], 16); 
XppPreloadClean(13, &block[5*16], 16); 
XppPreloadClean(14, &block[6*16], 16); 
XppPreloadClean(15, &block[7*16], 16); 
XppExecute( ); 
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XppPreloadConfig(_XppCfg_idctreorder); 
// Input Data 

// IRAMs 0,2,4,6 = 0x55 = 0M010101 
XppPreloadMultiple(0x55, block, 128); 
// Output Data 

XppPreloadClean( 8, &block[0*16], 16); 
XppPreloadClean( 9, &block[l*16], 16); 
XppPreloadClean(10, &block[2*16], 16) 
XppPreloadClean(ll, &block[3*16], 16) 
XppPreloadClean(12, &block[4*16], 16) 
XppPreloadClean(13, &block[5*16], 16) 
XppPreloadClean(14, &block[6*16], 16) 
XppPreloadClean(15, &block[7*16], 16) 
XppExecute( ); 



15 



} 



20 



25 



30 



for (n=0; n<block_count; n++) { 

cc = (n<4) ? 0 : (n&l)+l; /* color component index */ 
if (cc==0) { 

/* luminance */ 

if ((pict_struct==FRAME_PICTURE) && mbi[k].dct_type) 
/* field DCT */ 

offs = i + ((n&l)«3) + width* (j+((n&2)»l)); 
lx = width«l; 

} 

else { 

/* frame DCT */ 

offs = i + ((n&l)«3) + width2*G'+((n&2)«2)); 
lx = width2; 

} 

if (pict_struct==BOTTOM_FIELD) offs += width; 



else { 
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il = (chroma_format==CHROMA444) ? i : 
jl = (chroma_format!=CHROMA420) ? j : j»l; 
if ((pict_struct==FRAME_PICTURE) && mbi[k].dct_type 
&& (chroma_format!=CHROMA420)) { 
5 /* field DCT */ 

offs = il + (n&8) + chrom_width*(jl+((n&2)»l)); 

lx = chrom_width« 1 ; 

} 

else { 

10 /* frame DCT*/ 

offs = il + (n&8) + 

Chrom width2*G' 1 +((n&2)«2)); 

lx = chrom_width2; 

} 

1 5 if (pict_struct==BOTTOM_FIELD) offs += chrom width; 

} 

// fallback to RISC execution if blockcount != 4 

if(block_count != 4) idct(blocks[k*block_count+n]); 

else XppSync(blocks[k*block_count+n], 64/2); // ensure WB 



20 done for block 

lx,blocks[k*block_count+n]); 

} 

k++; 

25 } 

} 



add_pred(pred[cc]+offs,cur[cc]+offs, 



f06^-_XppCfg_idctrow TABLE US 00200 
#defme Wl 2841 /* 2048*sqrt(2)*cos(l*pi/16) */ 
30 #defme W2 2676 /* 2048*sqrt(2)*cos(2*pi/16) */ 
#defme W3 2408 /* 2048*sqrt(2)*cos(3*pi/16) */ 
#defme W5 1609 /* 2048*sqrt(2)*cos(5 !i: pi/16) */ 

#defme W6 1 108 /* 2048*sqrt(2)*cos(6*pi/16) */ 
#defme W7 565 /* 2048*sqrt(2)*cos(7*pi/16) */ 
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/** XppCfg_idctrow( ) 

* Does idct row calculation for 4 blocks 

* XPPIN: iram0,2,4,6 contains 4 blocks (data duplication) 

* XPPOUT: iram8-15 contains transposed calc. results 

*/ 

void _XppCfg_idctrow( ) { 

// Input IRA Ms in RAM Mode 

int iram0[128], iram2[128], iram4[128], iram6[128]; 
// Output IRAMs in FIFO Mode 

int *iram8, *iram9, *iramlO, *iramll, *iraml2, *iraml3, *iraml4, *iraml5; 

int rO, rl, r2, r3, r4, r5, r6, r7, r8; 

intrOl, r23, r45, r67; 

// Counter offsets for parallel access 

int i0=0, il=l, i2=2, i3=3; 

int k; 

for(k=0; k<32; k++) { 

// Data layout of input array is: 

//rowOblkO, row7blk0, rowOblkl, row7blk3 

// (with 4 packed columns([0,l],[2,3],[4,5],[6,7])) 

//0 3, 28 31, 32 35, 124 127 

rOl = iram0[i0+=4]; // row element 0 and 1 

r23 = iram2[il+=4]; // row element 2 and 3 

r45 = iram4[i2+=4]; // row element 4 and 5 

r67 = iram6[i3+=4]; // row element 6 and 7 

// Packed row elements have to be separated with split 16 

split 16(r01,r4, rO); 
_splitl6(r23, r7, r3); 
_splitl6(r45, r6, rl); 
_splitl6(r67, r5, r2); 
rl =rl«ll; 

rO = (rO«l 1) + 128; /* for proper rounding in the fourth stage */ 
/* first stage */ 
r8 = W7*(r4+r5); 
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r4 = r8 + (Wl-W7)*r4; r5 ~ r8 (Wl+W7)*r5; r8 - W3 *(r6+r7); r6 - r8 (W3 
W5)*r6; r7 - r8 (W3+W5)*r7; 

r5 = r8 - (Wl+W7)*r5; 

r8 = W3*(r6+r7); 



r6 = r8 - (W3-W5^*r6: 



r7 = r8 - (W3+W5)*r7: 



/* second stage */ 
r8 = r0 + rl; 
rO -=rl; 

10 rl = W6*(r3+r2); r2 - rl (W2+W6)*r2; r3 - rl + (W2 W6)*r3; rl - H + r6; H 

— r6; r6 ~ r5 + r7; r5 ~ r7; 

r2 = rl - (W2+W6)*r2; 

r3 =rl + (W2-W6)*r3; 



rl = r4 + r6; 



15 r4 -= r6; 



r6 = r5 + r7; 



r5 -= r7: 



/* third stage */ 
r7 = r8 + r3; 
20 r8-=r3; 



r3 = rO + r2; 

rO -= r2; r2 - (181*(H+r5)+128)»8; rl - (181*(H r5)+128)»8; 
r2 = (T81*(r4+r5H128>>8; 



r4 = (181*(r4-r5H128>>8: 



25 /* fourth stage */ 

// write 16 does vertical packing on row element streams (columns) 
// to have horizontal packing on columns for the next configuration 
_writel6(iram8, k, (r7+rl)»8); 

write 16(iram9, k, (r3+r2)»8); 
30 _writel6(iraml0, k, (r0+r4)»8); 

write 16(iramll, k, (r8+r6)»8); 
_writel6(iraml2, k, (r8-r6)»8); 
_writel6(iraml3, k, (r0-r4)»8); 

writel6(iraml4, k, (r3-r2)»8); 
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_writel6(iraml5, k, (r7-rl)»8); 

} 

} 



5 f9664f _XppCfg_idctcol TABLE US 00201 

#defme Wl 2841 /* 2048*sqrt(2)*cos(l*pi/16) */ 
#defme W2 2676 /* 2048*sqrt(2)*cos(2*pi/16) */ 
#defme W3 2408 /* 2048*sqrt(2)*cos(3*pi/16) */ 
#defme W5 1609 /* 2048*sqrt(2)*cos(5*pi/16) */ 
10 #defme W6 1 108 /* 2048*sqrt(2)*cos(6*pi/16) */ 
#defme W7 565 /* 2048*sqrt(2)*cos(7*pi/16) */ 
/** XppCfg idctcolQ 

* Does idct column calculation for 4 blocks 

* XPPIN: iram0,2,4,6 contains 4 blocks (data duplication) 
15 * XPPOUT: iram8-15 contains transposed calc. results */ 

void _XppCfg_idctcol( ) { 

// Input IRAMs in RAM Mode 

int iram0[128], iram2[128], iram4[128], iram6[128]; 
// Output IRAMs in FIFO Mode 
20 int *iram8, *iram9, *iraml0, *iramll, *iraml2, *iraml3, *iraml4, *iraml5; 

int cO, cl, c2, c3, c4, c5, c6, c7, c8; 
int cOl, c23, c45, c67; 
// Counter offsets for parallel access 
int i0=0, il=l, i2=2, i3=3; 
25 int k; 

for(k=0; k<32; k++) { 

// Data layout of input array is: 

//colOblkO, col0blk3, collblkO, col7blk3 

// (with 4 packed rows([0,l],[2,3],[4,5],[6,7])) 
30 // 0 3, 12 15, 16 19, 124 127 

cOl = iram0[i0+=4]; // column element 0 and 1 

c23 = iram2[il+=4]; // column element 2 and 3 

c45 = iram4[i2+=4]; // column element 4 and 5 

c67 = iram6[i3+=4]; // column element 6 and 7 
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// Packed column elements have to be separated with _splitl6 

split 16(c01, c4, cO); 
_splitl6(c23, c7, c3); 

split 16(c45, c6, cl); 
5 _splitl6(c67, c5, c2); 

cl =cl«8; 

c0 = (c0«8) + 8192; 

/* first stage */ 

c8 = W7*(c4+c5) + 4; 

10 c4 = (c8+(W 1 - W7) * c4)»3 ; c5 - (c8 (Wl+W7)*c5)»3; c8 - W3*(c6+c7) + 1; 

c6 ~ (c8 (W3 W5)*c6)»3; c7 - (c8 (W3+W5)*c7)»3; 



c8 = W3*fc6+c7) + 4; 


c6 = Cc8-fW3-W5)*c6>>3; 


15 


c7 = fc8-fW3+W5)*c7>>3; 




/* second stage */ 




c8 = cO + cl; 




cO -=cl; 



cl = W6*(c3+c2) + 4; c2 - (cl (W2+W6)*c2)»3; c3 - (cl+(W2 W6)*c3)»3; 
20 cl — e4 + c6; e4 — c6; c6 — c5 + c7; c5 — c7; 

c2 = (cl-(W2+W6)*c2)»3; 

c3 = (cl+(W2-W6)*c3)»3; 



cl = c4 + c6; 



c4 -= c6; 



25 c6 = c5 + c7; 

c5 -= c7; 

/* third stage */ 
c7 = c8 + c3; 
c8 -= c3; 

30 c3 = cO + c2; 



cO -= c2; c2 ~ (181*(d+c5)+128)»8; e4 - (181*(d c5)+128)»8; 
c2 = (181*(c4+c5)+128)»8; 



c4 = f 181*(c4-c5H128>>8: 



/* fourth stage */ 
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// write 16 does vertical packing on column element streams //-(blocks) 

IJ_ to have horizontal packing on blocks for the next configuration 

_writel6(iram8, k, (c7+cl)»14); 
_writel6(iram9, k, (c3+c2)»14); 
5 _writel6(iraml0, k, (c0+c4)»14); 

write 16(iramll, k, (c8+c6)»14); 
_writel6(iraml2, k, (c8-c6)»14); 
_writel6(iraml3, k, (c0-c4)»14); 
_writel6(iraml4, k, (c3-c2)»14); 
10 _writel6(iraml5, k, (c7-cl)»14); 

} 

} 

f©66S3-_XppCfg_idctreorder TABLE US 00202 
1 5 #defme min(A,B) (((A)>=(B))?(A):(B)) 
#defme max(A,B) (((A)>=(B))?(B):(A)) 
/** _XppCfg_idctreorder( ) 

* Saturates and restores original data layout 

* XPPIN: iram0,2,4,6 contains 4 blocks (data duplication) 
20 * XPPOUT: iram8-15 contains transposed calc. results */ 

void _XppCfg_idctreorder( ) { 

// Input IRAMs in RAM Mode 

int iram0[128], iram2[128], iram4[128], iram6[128]; 

// Output IRAMs in FIFO Mode 
25 int *iram8, *iram9, *iraml0, *iramll, *iraml2, *iraml3, *iraml4, *iraml5; 

int bOl, bOh, bll, blh, b21, b2h, b31, b3h; 

intbOll, bOlh, b231, b23h; 

// Counter offsets for parallel access 

int i0=0, il=0+64, i2=l, i3=l+64; 
30 intk; 

for(k=0; k<32; k++) { 

// Data layout of input array is: 

// rowOcolO, row0col7, rowlcolO, row7col7 
// (with 2 packed blocks(0, 1,2,3)) 
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//O 1, 14 15, 16 17, 124 127 

bOll = iramO[iO+=2]; // fetch lower half of block 0 and 1 
bOlh = iram2[il+=2]; // fetch upper half of block 0 and 1 
b231 = iram4[i2+=2]; // fetch lower half of block 2 and 3 
5 b23h = iram6[i3+=2]; // fetch upper half of block 2 and 3 

// Packed blocks have to be separated with split 16 
_splitl6(b011, bll, bOl); 
split 16(b01h,blh, bOh); 
split 16(b231, b31, b21); 
10 _splitl6(b23h, b3h, b2h); 

// write 16 does vertical packing on block streams to 
// have horizontal packing on rows as in the original data layout 
_writel6(iram8, k, min(max(b01,-256),255)); 
write 16(iram9, k, min(max(b0h,-256),255)); 
15 _writel6(iraml0, k, min(max(bll,-256),255)); 

write 16(iraml 1, k, min(max(blh,-256),255)); 
_writel6(iraml2, k, min(max(b21,-256),255)); 
_writel6(iraml3, k, min(max(b2h,-256),255)); 
_writel6(iraml4, k, min(max(b31,-256),255)); 
20 _writel6(iraml5, k, min(max(b3h,-256),255)); 

} 

} 

5r9r6-Performance Evaluation 

25 

[0666] To guarantee fair conditions for this example, we have to compare the total amounts of 
cycles the idct-algorithm executes on a fixed amount of data, once on the reference system, and 
once on the XPP-RISC combination. As determining cycle times of single configurations for 
execution on the RISC processor causes unrealistic bad results for execution on the reference 
30 system, we decided to compare on a total to total basis. 

Data Transfer Times 
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[0667] The cycle times for data transfer are listed in the table below. It is assumed that there is 
no data in the cache before executing the idct algorithm. TABLE US 00203 Typo size Size 
Cache RAM Cache Cache IRAM Data Data Size [bytes] [bytes] Misses [cache cycles] [cache 
cycles] Preloads Input data of idctrow 128 4 512 16 896 32 Input data of idctcol 128 1 512 0 0 
5 32 Input data of idctrcordor 128 1 51200 32 Sum 896 96 Writebacks Output data of idctrow 
128 1 512 0 0 32 Output data of idctcol 128 1 512 0 0 32 Output data of idctrcordcr 128 1512 1 
568 32 Sum 568 96 



10 DataData SizeType size [bytes]Size [bytes]Cache Misses RAM - Cache [cache cycles]Cache - 
IRAM [cache cycleslPreloadsInput data of idctrow!2845121689632Input data of 
idctcol 1 2845 120032Input data of idctreorderl2845120032Sum89696WritebacksOutt>ut data of 
idctrowl2845120032Output data of idctcol 12845 120032Output data of 
idctreorderl284512156832Sum56896 

15 [0668] Only the first preload causes a cache misses as all other configurations operate on the 
same data, and there is no need to load data from RAM. The same applies for the write-backs. 
As output data created by idctrow and idctcol are only temporary, and immediately consumed 
by the subsequent configurations, they are never written back to RAM. Only the final output 
created by idctreorder has to be written back to RAM. 

20 

[0669] Final Performance Results for the First Iteration TABLE US 00201 Data Access 
Configuration XPP Execute Rof System Speedup configurations RAM DCache RAM ICache 
Core Cache RAM Cache RAM Core Cache RAM idctrow 896 32 1 021 8 11 6 1 660 11 6 1 11111 
0.0 0.0 0.0 idctcol 0 32 10610 1513 728 1513 10610 0.0 0.0 0.0 idctrcordcr 0 32 5010 711 156 
25 711 5010 0.0 0.0 0.0 all configurations 896 96 25816 3688 1511 3688 26712 7860 8756 5.1 2.1 
0^ 

Data AccessConfigurationXPP ExecuteRef SvstemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCacheRAMidctrow89632 10248 146 1660 1461 1 1 1440.0 
30 0.00.0idctcol0321064015137281513106400.00.00.0idctreorder032504071415671450400.00.00 
.Oall cfgs896962581636881544368826712786087565. 12.10.3 

[0670] Final Performance Results for the Subsequent Iterations TABLE US 00205 Data Access 
Configuration XPP Execute Ref System Speedup configurations RAM DCache RAM ICache 
Core Cache RAM Cache RAM Core Cache RAM idctrow 896 32 660 660 896 0.0 0.0 0.0 
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idctcol 0 32 728 728 728 0.0 0.0 0.0 idctroordor 0 32 156 156 156 0.0 0.0 0.0 all configurations 
896 96 0 0 1511 1511 1511 7860 8756 5.1 5.1 5.7 

Data AccessConflgurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 

ICacheCoreCacheRAMCacheRAMCoreCacheRAMidctrow896326606608960.00.00.0idctcol03 

27287287280.00.00.0idctreorder0321561561560.00.00.0all 

cfgs8969600154415441544786087565. 15.15.7 

5vW-Wavelet 

[0671] 5.10.1 Original Code TABLE US 00206 

#defme BLOCK SIZE 16 

#defme COL 64 

#defme ROW 1 

void forward_wavelet( ) 

{ 

int i,nt, *dmid; 

int *sp, *dp, d tmpO, d_tm.pl , d tmpi, s tmpO, stmpl; 
int mid, ii; 
int *x; 

ints[256],d[256]; 

for (nt=COL; nt >= BLOCK SIZE; nt»=l) { 

for (i=0; i < nt*COL; i+=COL) { /* column loop nest */ 
x = &int_data[i]; 
mid = (nt» 1) - 1; 
s[0] = x[0]; 

d[0] = x[ROW]; a[l] - x[2]; s[mid] - x[2*mid]; d[mid] - 
x[2*mid+ROW]; d[0] = (d[0] «1 ) b[0] b[1]; b[0] = s[0] + (d[0] » 2); d_tmp0 = d[0]; 
s tmpO ~ b[1]; for(ii~l; ii < mid; ii++) { s tmpl = x[2*ii+2]; d tmpl ~((x[2*ii+ROW]) « 1) 
s tmpO — s tmpl; d[ii] ~ d tmpl; s[ii] ~ s tmpO + ((d tmpO + d_tmpl)»3); d tmpO - d tmpl; 
s tmpO - s tmpl; } d[mid] ~ (d[mid] — s[mid]) « 1; s[mid] ~ s[mid] + ((d[mid 1] + d[mid]) » 
3); for(ii~0; ii <~ mid; ii++) ( x[ii] ~ s[ii]; x[ii+mid+l] ~ d[ii]; ) ) 

sfmidl = xr2*midl; 

dlmidl = xr2*mid+ROWl; 

droi = (droi «i ^-sroi -sni: 
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sroi = sroi + (droi » 2); 

d tmpO = drOl; 

s tmpO = sTU; 

for(ii=l; ii < mid; { 

5 s tmpl = xr2*ii+21; 

d tmpl =(rxr2*ii+ROWl^) « 1) - s tmpO - s tmpl; 

drill = d tmpl; 

slul = s tmpO + ((d tmpO + d tmpl>>3); 

dtmpO = dtmp 1 ; 

10 stmpO = stmpl; 

I 

dlmidl = (drmidl - srmidl) « 1; 

srmidl = srmidl + (Ydrmid-11 + drmidl) » 3); 

for(ii=0; ii <= mid; ( 

15 xTiil = sriil; 

xrii+mid+11 = drill; 

I 

I 

for (i=0; i < nt; i++) { /* row loop nest */ 
20 x = &int_data[i]; 

mid = (nt» 1) - 1; 
s[0] = x[0]; 

d[0] = x[COL]; s[l] - x[COL«l]; o[mid] ~ x[(COL«l)*mid]; d[mid] 

- x[(COL^l)*mid+COL]; d[0] - (d[0] « 1) s[0] s[l]; s[0] - o[0] + (drO] » 2); d_tmp0 - 
25 d[0]; s tmpO - s[l]; for(ii~l; ii < mid; ii++) { s tmpl - xr2*COL*(ii+l)]; d tmpl - 

(xP^COL^ii+COL] « 1) s tmpO s tmpl; d[ii] ~ d tmpl; o[ii] ~ s tmpO + ((d tmpO + 

d tmpl) » 3); d tmpO ~ d tmpl; s tmpO ~ s tmpl; ) d[mid] ~ (d[mid] « 1) — (s[mid] « 1); 

s[mid] ~ s[mid] + ((d[mid 1] + d[mid]) » 3); for(ii~0; ii <~ mid; ii++) { x[ii^COL] - s[ii]; 

x[(ii+mid+l)*COL] - d[ii]; } }} } 
30 sril = xrCOL«ll; 

srmidl = xr(COL«n*mid1; 

drmidl = xrCCOL«l)*mid+COLl; 

droi = (droi « n-sroi -srii; 

sroi = sroi + (droi » 2); 

NY0 1 1 64 1 442 1 93 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



d tmpO = droi; 

s tmpO = s|T1; 

for(ii=l; ii < mid; { 

s tmpl = xr2*COL*(u+lYI; 

5 d tmpl = (xr2*COL*ii+COLl « 1) - s tmpO - s tmpl; 

d[ii] = dtmpl; 

s[ii] = s tmpO + ((d tmpO + d tmpl) » 3); 

d tmpO = d tmp 1 ; 

stmpO = stmpl; 

10 I 

dlmidl = (dlmidl « 1) - (slmidl « 1); 

slmidl = stoidl + (Vdrmid-11 + drmidD » 3); 

for(ii=0; ii <= mid; ii++) { 

xrii*COLl = sfiil; 

15 xlYii+mid+n*COL1 = dlul; 

I 

I 

I 

I 

20 

[0672] The source code exhibits a loop nest depth of three. Level 1 is an outermost loop with 
induction variable nt. Level 2 consists of two inner loops with induction variable i, and level 3 
is built by the four innermost loops with induction variable ii. The compiler notices by means of 
value range analysis, that nt will take on three values only (64, 32, and 16). As all inner loop 

25 nest iteration counts depend on the knowledge of the value of nt, the compiler will completely 
unroll the outermost loop, leaving us with six level 2 loop nests. As the unrolled source code is 
relatively voluminous we restrict the further presentation of code optimization to the case where 
nt takes the value 64. The two loops of level 2 of the original source code are highly symmetric, 
so we start the presentation with the first, or column loop nest, and handle differences to the 

30 second, or row loop nest, later. 

5.10.2 Optimizing the Column Loop Nest 
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[0673] After pre-processing, application of copy propagation followed by dead code elimination 
over s tmpl, dtmpl, and constant propagation for nt (64) and mid (31) we obtain the 
following loop nest. For readability reasons we rename the unwieldy variable names s tmpO by 
sO, d tmpO by dO, and ii by the more common index j. TABLE US 00207 



for (i=0; i < 64*64; i+=64) { 
x = &int_data[i]; 
s[0] = x[0]; 
d[0] = x[l]; 
s[l] = x[2]; 
s[31] = x[62]; 
d[31] = x[63]; 

d[0] = (d[0] « 1) - s[0] - s[l]; 
s[0] = s[0] + (d[0] » 2); 
dO = d[0]; 
sO = s[l]; 

for(j=l;j<31;j++) { 

dD] =((x[2*j+l]) « 1) - sO - x[2*j+2]; 
S Q] = s0 + ((d0 + dD]) » 3); 
dO = d£j]; 
sO = s[j]; 

} 

d[31] = (d[31] -s[31]) « 1; 

s[31] = s[31] + ((d[30] + d[31]) » 3); 

for(j=0;j<=31;j++) { 

xD"] = sD"]; 

xD+32] = dQ]; 

} 

} 



[0671] FIG. 32 shows the dataflow graph of the innermost loop nest. 



[0675] From the dataflow, graph of the first innermost loop nest (induction variable j) the 
compiler computes an optimization table. In this stage of optimization it just counts 
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computations and neglects the secondary effort necessary for IRAM address generation and 
signal merging. If there are different possibilities to perform an operation on the XPP in this 
initial stage, the compiler schedules ALU with highest priority. Inputs from or outputs to arrays 
with address differences of less than 128 words (IRAM size) are always counted as coming 
5 from the same IRAM. Hence the first innermost loop needs three input IRAMs (sO, dO, x[2*j+l] 
and x[2*j+2]) and two output IRAMs (s, d). The second innermost loop needs two input IRAMs 
(s, d) and one output IRAM (x[j] and x[j+32]). TABLE US 00208 Parameter Value Vector 
length 30 Reused data set size — I/O IRAMs 31 + 20 ALU 5 BREG 1 (shift right by three) 
FREG 0 Dataflow graph width 2 Dataflow graph height 6 Configuration cycles 5*30 + 2 
10 ParameterValueVector length.30Reused data set size-I/O IRAMs' 3 1+2QALU5BREG. 1 
(shift right by three)FREG0Dataflow graph width2Dataflow graph height6Configuration 
cycles5*30 + 2 

[0676] The compiler recognizes from this table that the XPP core is by far not used to capacity 
by the first innermost loop. Data dependence analysis shows that the output values of the first 

15 innermost loop are the same as the input values for the second innermost loop. Finally the 

second innermost loop has nearly the same iteration count as the first one. So the compiler tries 
to merge the second innermost loop with the first one. However, data dependence analysis 
shows that the fusion of the two loops is not legal without further measures, as this introduces 
loop carried anti-dependences within the x array. During iteration j=l of the second innermost 

20 loop for instance, x[33] of the original x array is overwritten, while during iteration j= 16 of the 
first innermost loop the original value of x[33] must be available. The cache memory layout of 
the XPP, however, allows a neat and cheap solution to this problem. One cache memory area 
can be mapped to two different IRAMs, one for reading, and one for writing. As the IRAM 
filling from the cache is triggered by XppPreload commands, the read-only IRAM is filled once 

25 before the configuration is executed. It does not interfere with the values written to the write- 
only IRAM. Hence the dependence vanishes without any explicit array copying. For correctness 
of the transformed source code we introduce a temporary output array t and a (cost free) array 
copy loop after the merged innermost loops. As mentioned above the iteration counts of the two 
innermost loops are not equal. Hence peeling of the first as well as of the last iteration of the 

30 second loop is necessary. Data dependence analysis shows that the peeled code as well as the 
d[3 1] and s[3 1] assignments before the second loop can be moved after the second loop. Now 
the two loops are merged leaving us with the following code: TABLE US 00209 

for (i=0; i < 64*64; i+=64) { 
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int t[64] ; // Temporary array built by output IRAM 
x = &int_data[i]; 
s[0] = x[0]; 
d[0] = x[l]; 
5 s[l] = x[2]; 

s[31] = x[62]; 
d[31] = x[63]; 

d[0] = (d[0] « 1) - s[0] - s[l]; 
s[0] = s[0] + (d[0] » 2); 
10 d0 = d[0]; 

sO = s[l]; 

for(j'=l;j<31;j++){ 

d\j] =(x[2*j+l] « 1) - sO - x[2*j+2]; 
s[j] = sO + ((d0 + dD"]) » 3); 
15 d0 = d[j]; 

sO = sD']; 

tD'] = sD']; 
tQ+32] = dQ]; 

} 

20 // The following array copy code is implicitely 

// done by the cache controller. 
for(j=l;j<31;j++) { 

xM = tD"]; 

x[i+32] = tQ+32]; 

25 } 

d[31] = (d[31] -s[31]) « 1; 

s[31] = s[31] + ((d[30] + d[31]) » 3); 

x[0] = s[0]; 

x[32] = d[0]; 
30 x[31] = s[31]; 

x[63] = d[31]; 

} 
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[0677] N ext the compiler tries to reduce IRAM usage. Data dependence analysis shows that the 
values of array s which are manipulated within the innermost loop are not used outside of the 
loop. d[30] is the only value which depends on values of array d calculated within the innermost 
loop. Thus the compiler replaces d[30] by t[62] outside of the loop. Now it is legal that array 
5 contraction replaces arrays s and d within the loop by scalars si and dl . A further IRAM 
reduction is done by using a common IRAM for the input scalars sO and dO (array sd). The 
tradeoff for this IRAM saving is a minor extra effort for the distribution of the two values to 
their dedicated PAE locations on the XPP. We arrive at: TABLE US 00210 

1 0 for (i=0; i < 64*64; i+=64) { 



20 



15 



int t[64] ; // Temporary array built by output IRAM 

x = &int_data[i]; 

s[0] = x[0]; 

d[0] = x[l]; 

s[l] = x[2]; 

s[31] = x[62]; 

d[31] = x[63]; 

d[0] = (d[0] « 1) - s[0] - s[l]; 
s[0] = s[0] + (d[0] » 2); 
dO = d[0]; 
sO = s[l]; 



//The 



following loop is executed on the XPP. 
l;j<31;j++){ 

dl =((x[2*j+l]) « 1) - sO - x[2*j+2]; 
si = sO + ((dO + dl) » 3); 



for (j 



25 



sO 



x[j] = si; 
x[j+32] = dl; 



30 



//The 




for (j 
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} 

d[31] = (d[31] -s[31]) « 1; 
s[31] = s[31] + ((t[62] + d[31]) » 3); 
x[0] = s[0]; 
5 x[32] = d[0]; 

x[31] = s[31]; 
x[63] = d[31]; 



1 0 [0678] with an optimization table TABLE US 00211 Parameter Value Vector 

ParameterValueVector length 30 Reused data set size —I/O IRAMs 21+ l O ALU 5 
OALU5 BREG 1 FREG 0 Dataflow graph width 2 Dataflow graph height 6 Configuration 
cycles 5 * 30 + 2 

[0679] The innermost loop does not exploit the XPP to capacity. So the compiler tries to unroll 

15 the innermost loop. For the computation of the unrolling degree it is necessary to have a more 
detailed estimate of the necessary computational units, i.e. the compiler estimates the address 
computation network for the IRAMs. Array x must provide two successive array elements 
within each loop iteration. This is done by an address counter starting with address 3 and 
closing with address 62 (1 FREG, 1 BREG). The IRAM data is then distributed to two different 

20 data paths by a demultiplexer ( 1 FREG) which toggles with every incoming data packet 

between the two output lines (1 FREG, 1 BREG). The same demultiplexer plus toggle network 
is necessary for the array sd. A merger (1 FREG, 1 BREG) is used to fetch the first data packet 
from sO and all others from si. A second one merges dO and dl. Finally two counters (2 FREG, 
2 BREG) compute the storage addresses, the first starting with address 1 , and the second with 

25 address 33. The resulting data as well as the addresses are crossed by mergers which toggle 
between the two incoming packet streams (4 FREG, 2 BREG). This results in the following 
optimization table. TABLE US 00212 Parameter Value Vector length 30 Roused data sot size — 
I/O IRAMs 21 + lO ALU 5 BREG 10 FREG 13 Dataflow graph width 2 Dataflow graph height 
6 Configuration cycles 5 * 30 + 2 

30 Parameter. ValueVector length30Reused data set size-I/O IRAMs2 1+1 

OALU5BREG 1 0FREG 1 3Dataflow graph width2Dataflow graph height6Configuration 
cvcles5*30+2 

[0680] The compiler computes from the maximum number of FREGs (80) and from the 
minimal number of FREGs per innermost loop (13) an unrolling degree equal to 6 (=80/13). On 
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the other hand, the IRAM use per innermost loop is 3 compared to 16 available IRAMs. From 
this, the compiler computes an unrolling degree equal to 5 (=16/3). The second innermost loop 
(induction variable i) is executed 64 times. In order to avoid additional RISC code, the iteration 
count should be a multiple of the unrolling degree. This finally results in an unrolling degree of 
5 4 and in the configuration source code listed below: TABLE US 00213 

/** _XppCfg_wavelet64( ) 

* Performs four innermost loops of the wavelet transformation 

* in parallel. 

10 * XPPIN: iramO sO O, dO O 

* iraml 64 integers of the x array of iteration i 

* iram2 sO 64, dO 64 

* iram3 64 integers of the x array of iteration i+64 

* iram4 s0_128, d0_128 

15 * iram5 64 integers of the x array of iteration i+128 

* iram6 s0_192, d0_192 

* iram7 64 integers of the x array of iteration i+192 

* XPPOUT: iram9 64 integers of the x array of iteration i 

* iraml 1 64 integers of the x array of iteration i+64 
20 * iraml 3 64 integers of the x array of iteration i+128 

* iraml 5 64 integers of the x array of iteration i+192 */ 
void _XppCfg_wavelet64( ) 

{ 

int iram0[128], iram2[128], iram4[128], iram6[128]; 
25 int iraml[128], iram3[128], iram5[128], iram7[128]; 

intiram9[128], iraml 1 [128], iraml3[128], iraml5[128]; 
int tmpdOO = iram0[0]; 

int tmp sO O = iram0[l]; int tmp_d0_6 / l — iram2[0]; int tmp_s0_6 / l - iram2[l]; int 
tmp_d0_128 ~ iraml [0]; int tmp_s0_128 - iraml [1]; int tmp_d0_192 - iram6[0]; int 
30 tmp_s0_192 - iram6[l]; 

int tmp_d0_64 = iram2[01; 

int tmp sO 64 = iram2ri1; 

int tmp dO 128 = iram4r01; 

int tmp sO 128 = iram4 [ 1 ] ; 
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int tmp dO 192 = iram6|"01; 

int tmp sO 192 = iram6 I" 11 ; 

forG=l;j<31;j++) { 

inttmpdlO, tmp_dl_64, tmp_dl_128, tmp_dl_192; 
5 int tmps 10, tmp s 164, tmps 1128, tmps 1192; 

tmp_dl_0= ((iraml [2*j+l]) « 1) - tmpsOO - iraml[2*j+2]; 
tmpslO = ((tmpdOO + tmpdlO) » 3) + tmpsOO; iram9[j] ~ tmpsOO ~ 
tmpslO; iram9[j+32] ~ tmpdOO ~ tmpdlO; tmp_dl_61 ~ ((iram3[2*j+l]) « 1) 
tmp_oO_61 — iram3[2*j+2]; tmp_sl_61 ~ ((tmp_d0_61 + tmp_dl_61) » 3) + tmp_s0_61; 
10 iramll[j] tmp_s0_6 f 1 — tmp_sl_6 / l; iraml l[j+32] tmp_d0_6 / l ~tmp_dl_61; tmp_dl_128 — 
((iram5[2*j+l]) « 1) tmp_s0_128 iram5[2*j+2]; tmp_sl_128 ~ ((tmp_d0_128 + 
tmp dl 128) » 3) + tmp sO 128; iraml3[j] ~ tmp sO 128 ~ tmp si 128; iraml3D"+32] - 
tmp_d0_128 ~ tmp_dl_128; tmp_dl_192 - ((iram7[2*j+l]) « 1) tmp_oO_192 
iram7[2^j+2]; tmp_al_192 - ((tmp_d0_192 + tmp_dl_192) » 3) + tmp_oO_192; iraml5|j] - 
15 tmp_s0_192 ~ tmp_sl_192; iraml 5[j+32] = tmp_d0_192 = tmp_dl_192; ] ] 

iram9hl = tmp sO 0 = tmp si 0; 

iram9ri+321 = tmp dO 0 = tmp dl 0: 

tmp dl 64 = ((iram3r2*i+lT)« n - tmp sO 64 - iram3r2*i+21; 

tmp_sl_64 = rrtmp_d0_64 + tmp_dl_64) » 3) + tmp_s0_64; 

20 iraml 1 [jj = tmp sO 64 = tmp si 64; 

iraml 1 H+321 = tmp dO 64 = tmp dl 64; 

tmp dl 128 = C(iram5r2*i+l"n « 1) - tmp sO 128 - iram5r2*i+21; 

tmp si 128 = ((tmp dO 128 + tmp dl 128) » 3) + tmp sO 128; 

iraml3ni = tmp sO 128= tmp si 128; 

25 iram!3ri+321 = tmp dO 128= tmp dl 128; 

tmp dl 192 = ((iram7r2*i+ll)« l)-tmp sO 192 - iram7r2*i+21; 

tmp si 192 = ((tmp dO 192 + tmp dl 192) » 3^ + tmp sO 192; 

iraml 5 Til = tmp sO 192 = tmp si 192; 

iraml5h+321 = tmp dO 192 = tmp dl 192; 

30 I 

I 

[0681] Two similar configurations handle the cases where nt=32 and nt=16. They are not shown 
here as they differ only in the number of loop iterations (15, and 7, respectively). 
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[0682] At this point some remarks about the further translation of the configuration code to 
NML code are useful. The necessary operational elements and connections are defined by the 
dataflow graph of FIG. 32. But this definition is incomplete. It does neither include which 
5 element to place in which cell of the XPP array (placing), nor does it allow an ad hoc decision 
which operation to execute in which computational unit. It is, for instance, possible to perform a 
subtraction in an ALU or in a BREG. These decisions are very delicate, as they highly influence 
the performance of the generated XPP code. In the current example the following strategy is 
applied. The first thing to notice is the cycle in the dataflow graph. It defines a critical path as it 

10 decides how many XPP cycles are at least necessary to provide a new output value. Counting 
along the dataflow cycle we find five operational elements from one si value to the next: 
merge, subtract, addl, shift right by 3, and add2. The worst case assumption is that every 
operational element takes one XPP cycle. This explains the 5*30+2 configuration cycles in the 
optimization tables. The XPP provides BREG elements which can be used to operate without a 

1 5 delay. The starting point is the shift right by 3 . This operation can be done in a BREG only. We 
define the NOREG property here (0 XPP cycles). Both neighboring additions are chosen as 
ALU operations (2 XPP cycles). The subtraction is done in a BREG with NOREG property (0 
XPP cycles), and the merge is only possible as FREG (1 XPP cycle). Hence we obtain a 
minimum of three XPP cycles per si value. But this result holds only if all operational elements 

20 of the cycle can be placed within one line of the XPP array, and within a bus section free of 
switch objects of the horizontal XPP buses. Hence the compiler must definitely choose the 
placement of this critical code section. Otherwise a severe deterioration of the performance is 
inevitable. 

25 5.10.3 Optimizing the Row Loop Nest 

[0683] The optimization of the row loop nest starts along the same lines as the column loop 
nest. After pre-processing, application of copy propagation followed by dead code elimination 
over s tmpl, d tmpl, and constant propagation for nt (64) and mid (31) the compiler peels off 
30 the first and last iteration of the second innermost loop, and moves the assignments between the 
two innermost loops after the second one. TABLE US 0021-1 

for (i=0; i < 64; i++) { 
x = &int_data[i]; 
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s[0] = x[0]; 

d[0] = x[64]; o[l] ~ x[64*2]; o[31] ~ x[61*62]; d[31] - x[61*63]; d[0] ~ (d[0] « 1) 
s[0] fl[l]; fl[0] ~ s[0] + (d[0] » 2); dO ~ d[0]; oO - s[l]; for j < 31; j++) ( dQ] 
-((x[64*(2*j+l)]) « 1) sO x[64*(2*j+2)]; s[j] ~ sO + ((dO + dl) » 3); dO - d^]; sO - aft']; } 
5 forG-l;j<31;j++) { x[61*j] - sfj]; x[61*(j+32)] ~ dQ']; ] d[31] ~ (d[31] « 1) (s[31]«l); 
a[31] ~ a[31] + ((x[61*62] + d[31]) » 3); x[0] - o[0]; x[32] - d[0]; x[61*3l] ~ a[31]; x[61*63] 
~d[31];] 

sril = xf64*21: 



s[3JJ = xr64*621; 
10 dr311 = xr64*631: 

droi = (droi « i) - sroi - sni; 
sroi = sroi + (droi » 2); 

dO = droi; 



sO = sril; 



15 for(i=l;i<31;i++) I 

dhl =ffxr64»f2*i+lW « n - sO - xr64*(2*i+2)1; 

sfil = sO + ((dO + dl) » 3): 



dO = d[ il; 



sO = sf.il; 



20 I 

for(i=l;i<31;i++) I 

xr64*j l = s [il; 

xr64*(i+32)l = dhl; 

I 



25 d\3U = (d\3U « 1) - (sr311 « 1); 

s[3JJ = sf311 + (fxr64*621 + d\3U) » 3): 

xroi - s^l; 

xr321 = droi: 



xr64*311 = sr311: 
30 xr64*631 = dr311; 

1 

[0681] Data dependence analysis computes an iteration distance of 64 for array x within the first 
innermost loop. As an IRAM can store at most 128 integers we run out of memory after the first 
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iteration of the innermost loop. Hence the compiler reorders the data to a new array y before the 
first innermost loop. A similar problem arises with the second innermost loop, where the 
compiler also introduces array y. The new array y suffers from the same array anti-dependences 
like array x in the previous section. The loop fusion preventing anti-dependence is overcome by 
5 the introduction of a temporary array t which guarantees correctness of the transformed source 
code. TABLE US 00215 

for (i=0; i < 64; i++) { 

inty[64],t[64]; 
10 x = &int_data[i]; 

s[0] = x[0]; 

d[0] = x[64]; 

s[l] = x[64*2]; 

s[31] = x[64*62]; 
15 d[31] = x[64*63]; 

d[0] = (d[0] « 1) - s[0] - s[l]; 

s[0] = s[0] + (d[0] » 2); 

dO = d[0]; 

sO = s[l]; 

20 // Column to row transfer. 

for(j'=l;j<31;j++){ 

y[2*j+l] = x[64*(2*j+l)]; 
y[2*j+2] = x[64*(2*j+2)]; 

} 

25 // The following loop is executed on the XPP. 

for(j=l;j<31;j++) { 

dD'] =((y[2*j+l]) « 1) - sO - y[2*j+2]; 
s[j] = sO + ((dO + dl) » 3); 
dO = dD"]; 

30 sO = s[j]; 

tD'] = sD"]; 
tD+32] = dD']; 

} 

// The following array copy code is implicitely 
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// done by the cache controller. 
for(j=l;j<31;j++) { 

yD'] = tD']; 

yD'+32] = dj+32]; 

} 

// Row to column transfer. 
for(j=l;j<31;j++) { 

x[64*j]=y[j]; 

x[64*(j+32)] = y[j+32]; 

} 

d[31] = (d[31]« l)-(s[31]« 1); 
s[31] = s[31] + ((x[64*62] + d[31]) » 3); 
x[0] = s[0]; 
x[32] = d[0]; 
x[64*31] = s[31]; 
x[64*31] = d[31]; 

} 

[0685] After loop fusion the second innermost loop looks exactly like the loop handled in the 
previous section and can thus use the same XPP configuration. The two surrounding reordering 
loops actually perform a transposition of a column vector to a row vector and are most 
efficiently executed on the RISC. 

5.10.1 Final Code 

[0686] The outermost loop is completely unrolled which produces six inner loop nests 
(induction variable i). Each of these inner loops is unrolled four times with the wavelet XPP 
configuration in the center. The unrolling of the inner loops requires a bundle of new local 
variables whose names are suffixed by the original iteration numbers. Array variables with 
constant array indices are replaced by scalar variables for readability reasons. s[0], for instance, 
becomes s0.sub.~0, sO.sub.— 64, sO. sub. —128, sO. sub. —192. 

[0687] One further loop transformation is necessary to facilitate the work of the cache 
controller. When the wavelet configuration finishes, a computation result in array x of each 
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iteration i is used in the succeeding RISC code. Hence an XppSync operation is necessary after 
each XppExecute which, forces a write-back of the IRAM contents to the first level cache. The 
RISC must wait until the write-back finishes. However, if the compiler splits the loop after 
XppExecute, it is possible to prepare the RISC data for the next configuration during the write- 
5 back operation of the cache controller (pipelining effect). The cost for the loop distribution is 
the expansion of some scalar variables, i.e. all scalars which are computed before and used after 
XppExecute must be expanded to array variables. Hence variable sO.sub.— 0, for instance, 
becomes sO.sub.— 0[16]. 

10 [0688] Loop distribution is applicable for both, the column as well as the row loop nest. 

However, in the case of the row loop nest this requires an array for each vector element of y, i.e. 
y actually becomes a matrix. In order to reduce the memory demand the compiler does no 
complete loop distribution, it rather executes the two loops shifted by a memory requirement 
factor. This loop optimization is called shifted loop merging (or shifted loop fusion) [7] . The 

1 5 memory requirement factor is chosen to a value of four as the architecture provides three IRAM 
shadows. 

[0689] As the final Code is voluminous because of successive loop unrolling we present the 



optimized RISC code for nt=64 only. TABLE US 00216 



20 



void forward_wavelet( ) 

{ 



25 



30 



inti,j,k; 

ints0_0[4], s31_0[4], sl_0; 
int s0_64[4], s31_64[4], sl_64; 
int s0_128[4], s31_128[4], sl_128; 
int s0_192[4], s31_192[4], sl_192; 
intd0_0[4], d31_0[4]; 

int d0_64[4], d31_64[4]; int d0_128[1], d3l_l28[1]; int d0_l92[1], d3l_l92[1]; 
intdO 128KI. d31 128[41; 



int dO 192[4L d31 192[41; 



int sd_0[2], sd_64[2], sd_128[2], sd_192[2]; 

int y_0[64][4], y_64[64][4], y_128[64][4], y_192[64][4]; 

for (i=0; i < 16*256; i+=256) { /* nt=64, column loop */ 
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if (i < 16*256) { /* XppPreload and XppExecute */ 

XppPreloadConfig(7sak XppCfg wavelet64); 

k = (i / 256) % 4; 
x = &int_data[i]; 
5 sO_0[k] = x[0]; 

dO_0[k] = x[l]; 
sl_0 = x[2]; 
s31_0[k] = x[62]; 
d31_0[k] = x[63]; 

10 sd_0[0] = d0_0[k] = (d0_0[k] « 1) - s0_0[k] - sl_0; 

sd_0[l] = s0_0[k] = (d0_0[k] » 2) + s0_0[k]; 

XppPreload (0, sd 0, 2); 

XppPreload (1, x, 64); 

XppPreloadClean (9, x, 64); 
15 x = &int_data[i+64] ; 

s0_64[k] = x[0]; 

d0_64[k] = x[l]; 

sl_64 = x[2]; 

s31_64[k] = x[62]; 
20 d31_64[k] = x[63]; 

sd_64[0] = d0_64[k] = (d0_64[k] « 1) - s0_64[k] - sl_64; 

sd_64[l] = s0_64[k] = (d0_64[k] » 2) + s0_64[k]; 

XppPreload (2, sd 64, 2); 

XppPreload (3, x, 64); 
25 XppPreloadClean (1 1 , x, 64); 

x = &int_data[i+128]; 

s0_128[k] = x[0]; 

d0_128[k] = x[l]; 

sl_128 = x[2]; 
30 s31_128[k] = x[62]; 

d31_128[k] = x[63]; 

sd_128[0] = d0_128[k] = (d0_128[k] « 1) - s0_128[k] - sl_128; 
sd_128[l] = s0_128[k] = (d0_128[k] » 2) + s0_128[k]; 
XppPreload (4, sd 128, 2); 
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XppPreload (5, x, 64); 
XppPreloadClean (13, x, 64); 
x = &int_data[i+192]; 
s0_192[k] = x[0]; 
5 d0_192[k] = x[l]; 

sl_192 = x[2]; 
s31_192[k] = x[62]; 
d31_192[k] = x[63]; 

sd_192[0] = d0_192[k] = (d0_192[k] « 1) - s0_192[k] - sl_192; 
10 sd_192[l] = s0_192[k] = (d0_192[k] » 2) + s0_192[k]; 

XppPreload (6, sd_192, 2); 

XppPreload (7, x, 64); 

XppPreloadClean (15, x, 64); 

XppExecute( ); 
15 }/* i< 16*256 */ 

if (i >= 4*256) { /* delayed XppSync */ 

k = (i - 4*256) % 4; 

x = &int_data[i-4*256]; 

Xppsync(x, 64); 

20 d31_0[k] = (d31_0[k] - s31_0[k]) « 1; 

s31_0[k] = s31_0[k] + ((x[62] + d31_0[k]) » 3); 

x[0] = s0_0[k]; 

x[32] = dO 0[k]; 

x[31] = s31_0[k]; 
25 x[63] = d31_0[k]; 

x = &int_data[i-4*256+64]; 

XppSync(x, 64); 

d31_64[k] = (d31_64[k] - s31_64[k]) « 1; 
s31_64[k] = s31_64[k] + ((x[62] + d31_64[k]) » 3); 
30 x[0] = s0_64[k]; 

x[32] = d0_64[k]; 

x[31] = s31_64[k]; 

x[63] = d31_64[k]; 

x = &int data[i-4*256+128]; 

NY01 1641442 2 08 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



XppSync(x, 64); 

d31_128[k] = (d31_128[k] - s31_182[k]) « 1; 

s31_128[k] = s31_128[k] + «x[62] + d31_128[k]) » 3); 

x[0] = s0_128[k]; 

x[32] = d0_128[k]; 

x[31] = s31_128[k]; 

x[63] = d31_128[k]; 

x = &int_data[i-4*256+192]; 

XppSync(x, 64); 

d31_192[k] = (d31_192[k] - s31_192[k]) « 1; o31_192[k] - s31_192[k] 
+ ((x[62] + d31_192[k]) » 3); x[0] - s0_192[k]; x[32] - d0_192[k]; x[31] ~ s31_192[k]; x[63] 
- d31 192[k]; ) /* i >- 1*256 */ ] 

s31 192rkl = s31 192rk1 + (Yxr621 + d31 192rkl)»3); 

xr01 = sO 192rkl; 

xr321 = d0 192rkl; 

xr311 = s31 192fkl; 

xr631 = d31 192rkl; 

> /* i >= 4*256 */ 

I 

for (i=0; i < 64+16; i+=4) { /* nt=64, row loop */ 

if (i < 64) { /* XppPreload and XppExecute */ 

XppPreloadConfig(7sak: XppCfg .sub. = wavelet64); 

k = (i / 4) % 4; x - &int data[i]; oO 0[k] - x[0]; dO_0[k] ~ x[61]; al 0 - 
x[128]; a31_0[k] ~ x[3968]; d31_0[k] ~ x[1032]; sd^0[0] - d0_0[k%1] - (dO_0[k] « 1) 
sO_0[k] sl_0; sd_0[l] ~ sO_0[k] - (dO_0[k] » 2) + sO_0[k]; for j < 31; j++) { 
y_0[2*j+l][k] ~ x[61+128*j]; y_0[2*j+2][k] = x[128+128*j]; ) 
x = &int_data|"i"|; 

sO orki = xroi: 

dO 0rkl = xr641; 

si 0 = xri281: 

s31 Orkl = x[3968]; 

d31 Orkl = xr40321; 

sd OrOl = dO 0rk%41 = (dO Orkl « 1) - sO Orkl - si 0; 

sd orn = sO orki = (dO orki » 2) + sO Orkl; 
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for (i=l;i<31;i++) I 

y 0r2*i+lirk1 = xr64+128*i1: 

y 0r2*i+2irkl = xri28+128*il: 

I 

5 XppPreload (0, sd_0, 2); 

XppPreload (1, y_0[k], 64); 
XppPreloadClean (9, y_0[k], 64); 

x = &int_data[i+l]; s0_64[k] - x[0]; d0_64[k] ~ x[6A]; al_6A ~ x[128]; 
o3l_61[k] ~ x[3968]; d3l_61[k] - x[1032]; od_61[0] - d0_61[k] - (d0_61[k] « 1) s0_61[k] 
10 sl_61; sd_61[l] - sO_6A[k] - (d0_61[k] » 2) + s0_61[k]; for (j~l; j < 31; j++) { 
y_64[2*j+l][k] ~ x[6^+128*j]; y_64[2*j+2][k] - x[128+128*j]; ) 

sO 64rkl = xr01; 

dO 64rkl = xr64l; 

si 64 = xri28l; 

15 s31 64rkl = xr39681; 

d31 64rkl = xr40321: 

sd 64r01 = dO 64rkl = (dO 64fkl « 1) - sO 64fk1 - si 64; 
sd 64ril = s0 64rkl = (d0 64fkl » 2) + sO 64rkl; 
for( j=l; j<31;j++) { 

v 64r2*i+lirkl = xr64+128*il: 
v 64r2*i+2irkl = xri28+128*i1: 

1 

XppPreload (2, sd 64, 2); 
XppPreload (3, y„64[k], 64); 
XppPreloadClean (11, y_64[k], 64); 
x = &int_data[i+2] ; 
s0_128[k] = x[0]; 
d0_128[k] = x[64]; 
sl_128 = x[128]; 
s31_128[k] = x[3968]; 
d31_128[k] = x[4032]; 

sd_128[0] = d0_128[k] = (d0_128[k] « 1) - s0_128[k] - sl_128; 
sd_128[l] = s0_128[k] = (d0_128[k] » 2) + s0_128[k]; 
for0"=l;j<31;j++) { 
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y_128[2*j+l][k] = x[64+128*j]; 
y_128[2*j+2][k] = x[128+128*j]; 

} 

XppPreload (4, sd_128, 2); XppPrcloadX 

5 ppPreload (5, y_128[k], 64); 

XppPreloadClean (13, y_128[k], 64); 
x = &int_data[i+3]; 
s0_192[k] = x[0]; 
d0_192[k] = x[64]; 
10 sl_192 = x[128]; 

s31_192[k] = x[3968]; 
d31 192[k] = x[4032]; 

sd_192[0] = d0_192[k] = (d0_192[k] « 1) - s0_192[k] - sl_192; 
sd_192[l] = s0_192[k] = (d0_192[k] » 2) + s0_192[k]; 
15 for(j=l;j<31;j++){ 

y_192[2*j+l][k] = x[64+128*j]; 

y_192[2*j+2][k] = x[128+128*j]; 

} 

XppPreload (6, sd_192, 2); 
20 XppPreload (7, y_192, 64); 

XppPreloadClean (15, y_192, 64); 
XppExecute( ); 

} /* i < 64 */ 

if (i >= 16) { /* delayed XppSync */ 
25 k = (i-16)%4; 

x = &int_data[i-16]; XppSyne(y_0[k], 61); for (j~l; j < 31; j++) ( 

x[61*j] ~ y_0D"]M; x[2018+61*j] ~ y_0D+32][k]; ) d31_0[k] = (d31_0[k] « 1) (s31_0[k] « 

1); o31_0[k] ~ s31_0[k] + ((x[3968] + d31_0[k]) » 3); x[0] - s0_0[k]; x[2018] ~ d0_0[k]; 

x[1981] ~ s31_0[k]; x[1032] - d31_0[k]; x - &int_data[i 16+1]; XppSync(y_61[k], 61); for 
30 0-1; j < 31; j++) [ x[61*j] ~ y_61D][k]; x[2018+61*j] - y_61[j+32][k]; ) d3l_61[k] - 

(d3l_61[k] « 1) (o3l_61[k] « 1); s3l_61[k] - s3l_61[k] + ((x[3968] + d3l_61[k]) » 3); 

x[0] ~ s0_61[k]; x[20^l8] - d0_61[k]; x[1984] - s31_61[k]; x[1032] - d3l_61[k]; 

XppSvnc(v Orkl. 64); 

for (j=l;j<31;j++) { 
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xr64*i1=v Ohirkl; 

xr2048+64*il =v 0ri+32irkl; 

I 

d31 Orkl = (d31 Orkl«n-(s31 Orkl«l); 

5 s31 Orkl = s31 Orkl + ((xr39681 + d31 Ofkl^ » 3^; 

xr01 = sO Orkl; 

xr20481 = d0 Orkl; 

xri9841 = s31 Orkl; 

xr40321 = d31 Orkl; 

10 x = &int_data[i- 16+11; 

XppSvncCv 64rkl, 64); 

for (i=l;i<31;i++) I 

xr64*il = y 64riirkl; 

xr2048+64*il =y 64ri+32irkl; 

15 I 

d31 64rkl-(d31 64rkl«lWs31 64rkl « 1): 

s31 64rkl = s31 64rkl + ((xr39681 + d31 64rkl^ » 3^; 

xr01 = s0 64rkl; 

xr20481 = d0 64rkl; 

20 xri9841 = s31 64rkl; 

xr40321 = d31 64rkl; 

x = &int_data[i-16+2]; 

XppSync(y_128[k], 64); for (pi; j < 31; j++) ( x[61+j] ~ y_128D][k]; 

x[2018+61*j] ~ y_128[j+32]|k]; ] d31_128[k] - (d31_128[k] « 1) (s31_128[k] « 1); 
25 s31_128[k] ~ s31_128[k] + ((x[3968] + d31_128[k]) » 3); x[0] - oO_128[k]; x[2048] - 

d0_128[k]; x[l981] = o31_128[k]; x[1032] = d31_128[k]; x = &int_data[i 16+3]; 

XppSync(y_192[k], 61); for Q-l; j < 31; j++) { x[61*j] ~ y_192[j][k]; x[2018+61*j] ~ 

y_192[j+32][k]; } d31_192[k] - (d31_192[k] « 1) (s31_192[k] « 1); s31_192[k] ~ 

s31_192[k] + ((x[3968] + d31_192[k]) » 3); x[0] - s0_192[k]; x[2018] ~ d0_192[k]; x[198A] 
30 ~ s31_192[k]; x[1032] - d31_192[k]; ) /* i >~ 16 */ ) /» nt~32, column loop »/ ... /* nt~32, 

row loop */ ... /* nt~16, column loop */ ... /* nt~16, row loop */ ... } 

for (i=l;i<31;i++) I 

x r64*il=v 128mrkl; 

xr2048+64*il =y 128ri+32irkl; 
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I 

d31 128rkl = (d31 128fkl « n - (s31 128Tkl « 1); 

s31 128rkl = s31 128rkl + ((xr39681 + d31 128JXD » 3); 

xr01 = sO 128fk1; 

5 xr20481 = d0 128fkl; 

xri9841 = s31 128fk1; 

xr40321 = d31 128fk1; 

x = &int datafi- 16+31; 

XppSvncCv 192rkl. 64); 

10 for (HI; i < 31; I 

xr64*il=y 192riirkl; 

xr2048+64*il =y 192ri+32irkl; 

I 

d31 192rkl = (d31 192fkl « 1) - (s31 192fk1 « 1): 

15 s31 192rkl = s31 192fkl + C(xr39681 + d31 192rkl)»3V. 

xfOl-sO 192rkl; 

xr20481 = d0 192fkl; 

xri9841 = s31 192rkl; 

xr40321 = d31 192rkl; 

20 } /* i>= 16 */ 

I 

/* nt=32, column loop */ 



/* nt=32, row loop */ 

25 ^ 

/* nt=16, column loop */ 



/* nt=16. row loop */ 



30 I 

5.10.5 Performance Evaluation 
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[0690] The performance evaluation of this example is based on the assumption that the code 
optimizations done for the XPP are also useful for the reference processor. Hence we compare 
the code executed within each configuration only. But this argumentation is not entirely correct 
for the current example, as the compiler applied a column to row transposition (and vice versa) 
5 for the row loop nest because of the restricted IRAM size. This optimization is not meaningful 
for the reference processor. This is why we correct the reference system performance values by 
subtracting the cycles necessary for the transposition. 

[0691] The data transfer performance for the configuration _XppCfg_wavelet64 as part of the 
10 column loop nest is listed in the following table. It is assumed that there is no data in the cache 
(startup case). TABLE US 00217 Size Cache RAM Cache Cache IRAM Data [bytes] Misses 
[cache cycles] [cache cycles] Preloads sd 0 8 1 56 1 int data 256 8 118 16 sd 61 8 1 56 1 
int_data + 61 256 8 118 16 od_128 8 1 56 1 int_data + 128 256 8 118 16 od_0 8 1 56 1 int_data 
+ 192 256 8 118 16 Sum 2016 68 Writebacks intdata 256 0 256 16 intdata + 61 256 0 256 16 
15 int data + 128 256 0 256 16 int data + 192 256 0 256 16 Sum 1021 61 

DataSize [bytes]Cache MissesRAM to Cache [cache cycles]Cache to IRAM [cache 
cvcleslPreloadssd 081561int data256844816sd 6481561int data + 
64256844816sd 12881561int data + 1282568448 16sd 081561int data + 
20 1 922568448 16Sum201668Writebacksint data256025616int data + 642560256 16int data + 
128256025616int data + 1 922560256 16Suml 02464 



[0692] The write -back of array in data causes no cache miss, because the relevant array sector 
25 is already in the cache (loaded by the corresponding preload operations). Therefore the write- 
back does not include cycles for write allocation. In row Sum the total number of cycles for the 
first execution of the whole _XppCfg_wavelet64 configurations is given. 

[0693] This configuration is invoked 1 6 times on different sectors of array int data. Hence the 
30 cache miss situation for array int data is identical in each iteration. No cache miss, however, is 
produced by accesses to the arrays sd as these are already in the cache. After the 16 iterations 
the whole array int data is loaded into the first level cache. The following table summarizes the 
data transfer cycles for the remaining 15 iterations (steady state case). TABLE US 00218 
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RAM Cacho Cacho IRAM Data [cacho cycles] [cacho cycles] Proloads Sum 1792 68 
Writebacks Sum 1024 64 

DataRam - Cache [cache cycles] Cache - I RAM [cache 
5 cvcleslPreloadsSuml 79268 WritebacksSuml 02464 

[0691] The configurations _XppCfg_wavelet32 and _XppCfg_waveletl6 as part of the column 
loop access the same arrays but with smaller data sizes. Hence there is no cache miss at all. The 
following tables summarize the data transfer cycles for the _XppCfg_wavelet32 and 
_XppCfg_waveletl6 configurations as part of the column loop nest (startup case=steady state 
10 case). TABLE US 00219 RAM Cache Cacho IRAM Data [cache cycles] [cacho cycles] 
Preloads Sum 0 36 Writebacks Sum 512 32 Preloads Sum 0 20 Writebacks Sum 256 16 
DataRam - Cache [cache cycles] Cache - IRAM [cache 
cvcles1PreloadsSum036WritebacksSum51232 

15 DataRam - Cache [cache cycles] Cache - IRAM [cache 
cvcles1PreloadsSum020WritebacksSum256 1 6 

[0695] The data transfer, performance for, configuration _XppCfg_wavelet64 as part of the row 
loop nest is listed in the following table (startup case). TABLE US 00220 Size Cache RAM 
Cacho Cacho IRAM Data [bytes] Misses [cacho cycles] [cacho cycles] Proloads sd_0 8 0 0 1 

20 y_0[k] 256 8 418 16 sd_64 8 0 0 1 y_61[k] 256 8 118 16 sd_128 8 0 0 1 y_128[k] 256 8 148 16 
sd_0 8 0 0 1 y_192[k] 256 8 448 16 Sum 1792 68 Writebacks y_0[k] 256 0 256 16 y_64[k] 256 
0 256 16 y_128[k] 256 0 256 16 y_192[k] 256 0 256 16 Sum 1021 64 
DataSize [bytes]Cache MissesRAM to Cache [cache cycles]Cache to IRAM [cache 
cycleslPreloadssd 08001v 0[kl256844816sd 648001v 64[kl256844816sd 1288001v 128[kl2 

25 568448 16sd 0800 lv 192[kl256844816Suml79268Writebacksv 0[kl256025616v 64[kl25602 
5616v 128[kl256025616v 192[kl256025616Suml 02464 

[0696] Here the situation is a bit more complicated. The table is valid for the first four iterations 
as k loops from zero to three which produce cache misses for the y arrays. After 4 iterations all 
y arrays are in the cache and no further cache miss occurs. Hence the nest table shows the 
30 cycles for iterations 5 to 16 (steady state case). TABLE US 00221 RAM Cache Cacho IRAM 
Data [cacho cycles] [cacho cycles] Proloads Sum 0 68 Writebacks Sum 1024 64 
DataRam - Cache [cache cycles] Cache - IRAM [cache 
cycles1PreloadsSum068WritebacksSuml02464 
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[0697] The configurations XppCfg wavelet32 and XppCfg_waveletl6 as part of the row 
loop nest have the same data transfer performance as if they were used as part of the column 
loop nest. Again, this is due to the fact, that no cache miss occurs. 

5 [0698] The base for the comparison are the hand- written NML source codes wavelet64.nml, 
wavclct32.nml and waveletl6.nml which implement the configurations _XppCfg_wavelet64, 
_XppCfg_wavelet32 and _XppCfg_waveletl6, respectively. Note, that these configurations are 
completely placed by hand in order to obtain a clearly arranged cell structure for debugging 
reasons. It is, however, possible to automatically place most modules without a significant 
10 decrease in performance. The only exception is the LOOP module the contents of which must 
be definitely placed by the compiler (see section 5.10.2). 

[0699] The following two performance tables present the overall results. The first table shows 
the startup case where neither data nor configurations are preloaded in the cache. As 

1 5 configuration loading is extremely expensive it dominates all figures and guarantees a poor 
performance. The second table presents the steady state case after a (theoretically) infinite 
number of iterations. Now a data preload followed by a write-back are done during the 
execution of a configuration. However, we constantly work at new sections of array intdata. 
This is why we have a steady load from RAM to the cache and a write from the cache to RAM. 

20 This memory bottleneck degrades the overall performance to a factor of 1,6. On the assumption 
that array int data is handled several times by the forward wavelet function, the whole data 
remains in the cache and the performance increases to the considerable factor of 3,9. The 
example demonstrates that only loop bodies with a considerable amount of computations 
promise a considerable performance gain. Pure data shuffling applications suffer with the XPP 

25 from the same memory limitations as the RISC host processor. TABLE US 00222 Data Access 
Configuration XPP Execute Rcf System Speedup configurations RAM DCachc RAM ICachc 
Core Cache RAM Cache RAM Core Cache RAM wavolct61 (column nest) 2016 68 7728 1 100 
212 1100 9711 1020 3036 1.8 0.9 0.3 wavclct61 (row nest) 1792 68 0 0 212 212 1792 801 2596 
3.8 3.8 1.1 wavclct32 (column nest) 0 36 7728 1100 116 1100 7728 192 192 1.2 0.1 0.1 

30 wavolot32 (row neat) 0 36 0 0 1 16 1 16 1 16 388 388 3.3 3.3 3.3 wavolotl6 (column neat) 0 20 
7728 1100 68 1100 7728 228 228 3.1 0.2 0.0 wavolotl6 (row neat) 0 20 0 0 68 68 68 180 180 
2.6 2.6 2.6 all configurations 3808 218 23128 3300 792 3300 26936 3112 6920 3.9 0.9 0.3 
wavclct61 (column nest) 2816 68 212 212 2816 1020 3836 1.8 1.8 1.1 wavclct61 (row nest) 
1021 68 212 212 1021 801 1828 3.8 3.8 1.8 wavolot32 (column nost) 512 36 116 116 512 192 
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1001 1.2 1.2 2.0 wavolot32 (rownoot) 512 36 116 116 512 388 900 3.3 3.3 1.8 wavolotl6 
(column noat) 256 20 68 68 256 228 181 3.1 3.1 1.9 wavolotl6 (row noot) 256 20 68 68 256 180 
136 2.6 2.6 1.7 all configurations 5376 218 792 792 5376 3112 8188 3.9 3.9 1.6 

5 Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 
ICacheCoreCacheRAMCacheRAMCoreCachePvAMwavelet64 (column 
nesf)20 1 6687728 1 1 002 1 2 1 1 009744 1 020303 64.80.90.3 wavelet64 (row 
nest) 1 79268002 1 22 1 2 1 79280425963 .83 .8 1 .4wavelet32 (column 
nest)03677281 1001 161 10077284924924.20.40. Iwavelet32 (row 
10 nesf)036001 161 161 163883883.33.33.3waveletl6 (column 

nest)02077281 100681 10077282282283.40.20.0waveletl6 (row 

nest)02000686868 1 80 1 802.62.62.6all 

cfgs380824823 12833007923300269363 1 1269203.90.90.3 

Data AccessConfigurationXPP ExecuteRef. SystemSpeedupconfigurationsRAMDCacheRAM 
15 ICacheCoreCacheRAMCacheRAMCoreCacheRAMwavelet64 (column 

nestY28 1 6682 1 22 1 228 1 6 1 0203 8364.84.8 1 .4wavelet64 (row 

nest) 1 024682 1 22 1 2 1 024804 1 8283 .83 .8 1 ,8wavelet32 (column 

nesf)5 123611611651 2492 1 0044.24.22.0wavelet32 (row 

nest)512361 161 165123889003.33.3 1.8waveletl6 (column 
20 nest)2562068682562284843.43.41.9waveletl6(row nesf)2562068682561 804362.62.61. 7all 

cfgs537624879279253763 1 1284883.93.91 .6 

[0700] The utilization of the XppCfg wavelet configurations shows that the XPP capacity is 
mostly used for memory (wavelet64.nml, wavelet32.nml, wavelet 1 6. nml). The information is 
taken from the " info" files generated from the NML source code by the GAP tool. TABLE US 
25 00223 Parameter V alue V ector 

Pa ameterValueVector length 30 (14, 6) 3 2 -bit values Rcuso d valuesReused data set size — I/O0 
IRAMs [sum -pet] 12 - 75% ALU[sum-pct] 12 - 19% BREG [def/route/sum-pct] 37/5/42 - 66% 
FREG [def/route/sum-pct] 40/2/42 - 66% 
5.11 Cone lusion 

30 

[0701] The theoretical results did not scale well to real world results. The biggest single 
performance loss was experienced during placement and routing. This on one hand 
demonstrates the potential of the architecture, but on the other hand also shows current 
limitations of the architecture as well as of the tools. 
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[0702] The following proposals may help to narrow the gap between theoretical and practical 
performance: 

5 5.11.1 RAM Bus Width 

[0703] A bus width of more than 32 bits is more apted for such a highly parallel architecture. 

5.1 1.2 Use of the Cache Instead of Separate IRAMs 

10 

[0701] As the utilization of the shadow IRAMs is less than the utilization of the cache, the 
second design without dedicated IRAM memory is more silicon efficient, also eliminating the 
cache-IRAM transfer cycles. 

1 5 5.11.3 Configuration Size 

[0705] The configuration bus is narrow compared to the average configuration size. The same is 
true for the instruction cache. The replicated structure of the array allows for a highly parallel 
reconfiguration bus from the instruction cache. A 128 bit bus can be split into eight 16 bit 
20 configuration busses to each line of the array. 

5.11.1 ALU/FREG/BREG Orthogonality 

[0706] The NOREG feature is limited to BREGs. Only one. BREG in a sequence can be in 
25 unregistered mode. This way it is possible to save cycles in a backend post optimization, if the 
BREGs can be set to unregistered mode. The number of saved cycles depends on the type and 
order of operations. This feature is unorthogonal and makes it hard for the compiler to estimate 
the actual number of cycles needed. 

30 [0707] The current specialization of the forward and backward units together with the delays on 
the busses interacts in a bad way with placement and routing: The type and sequence of the 
operations determines the direction of the computational flow: 
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[0708] If FREGs and BREGs can be used alternatingly, the computation propagates values 
along the line of the PAE array. All BREGs can be set to unregistered mode, saving half of the 
cycles. 

[0709] If FREGs and ALUs are used in line the computational flow either follows the column 
downward or the line in the array. For the latter mode, NOREG BREGs, must be used. 

[0710] If only BREGs are needed sequentially, the computational flow follows the column in 
upward direction. As at least every second BREG in line must be in registered mode, half of the 
cycles can be saved. 

[0711] If a PAE consists of a forward ALU, a forward REG, a backward ALU and a backward 
REG, this orthogonality would have positive effects on the freedom of placement and routing. 

5.1 1.5 Placement and Routing Improvements 

[0712] If placement and routing of the critical path is done first, followed by the placement and 
routing of the less critical components, less registers will be inserted into the critical path by the 
router. In general, several different heuristics should be used in placement and routing. 

[0713] Feedback from the placement and routing tool to the compiler can help avoid the added 
registers in the critical path. 

[0711] N ML currently does not cover specification of the bus switch elements. There is no way 
to control the register property of the switches. Control of this feature enables efficient control 
of bus delays with feedback directed compilation. 
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25 appear. 

[0728] The invention will now be described further and/or in other details by the following part 
of the description entitled "A Method for Compiling High Level Language Programs to a 
Reconfigurable Data-Flow Processor" . 

30 

1 Introduction 

[0729] This document describes a method for compiling a subset of a high-level programming 
language (HLL) like C or FORTRAN, extended by port access functions, to a reconfigurable 
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data-flow processor (RDFP) as described in Section 3. The program is transformed to a 
configuration of the RDFP. 

[0730] This method can be used as part of an extended compiler for a hybrid architecture 
5 consisting of standard host processor and a reconfigurable data-flow coprocessor. The extended 
compiler handles a full HLL like standard ANSI C. It maps suitable program parts like inner 
loops to the coprocessor and the rest of the program to the host processor. It is also possible to 
map separate program parts to separate configurations. However, these extensions are not 
subject of this document. 

10 

2 Compilation Flow 

[0731] This section briefly describes the phases of the compilation method. 
15 2.1 Frontend 

[0732] The compiler uses a standard frontend which translates the input program (e.g. a C 
program) into an internal format consisting of an abstract syntax tree (AST) and symbol tables. 
The frontend also performs well-known compiler optimizations as constant propagation, dead 
20 code elimination, common subexpression elimination etc. For details, refer to any compiler 
construction textbook like [ 1 ] . The SUIF compiler [2] is an example of a compiler providing 
such a frontend. 

2.2 Control/Dataflow Graph Generation 

25 

[0733] N ext, the program is mapped to a control/dataflow graph (CDFG) consisting of 
connected RDFP functions. This phase is the main subject of this document and presented in 
Section 4. 

30 2.3 Configuration Code Generation 

[0731] Finally, the last phase directly translates the CDFG to configuration code used to 
program the RDFP. For PACT XPP.TM. Cores, the configuration code is generated as anNML 
Native Mapping Language) file. 
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3 Configurable Objects and Functionality of a RDFP 

[0735] This section describes the configurable objects and functionality of a RDFP. A possible 
5 implementation of the RDFP architecture is a PACT XPP.TM. Core. Here we only describe the 
minimum requirements for a RDFP for this compilation method to work. The only data types 
considered are multi-bit words called data and single-bit control signals called events. Data and 
events are always processed as packets, cf. Section 3.2. Event packets are called 1 -events or 0- 
events, depending on their bit-value. 

10 

3.1 Configurable Objects and Functions 

[0736] An RDFP consists of an array of configurable objects and a communication network. 
Each object can be configured to perform certain functions (listed below). It performs the same 
1 5 function repeatedly until the configuration is changed. The array needs not be completely 

uniform, i.e. not all objects need to be able to perform all functions. E. g., a RAM function can 
be implemented by a specialized RAM object which cannot perform any other functions. It is 
also possible to combine several objects to a "macro" to realize certain functions. Several RAM 
objects can, e.g., be combined to realize a RAM function with larger storage. 

20 

[0737] The following functions for processing data and event packets can be configured into an 
RDFP. See FIG. 33 for a graphical representation. [0738] 

* ALU[opcode]: ALUs perform common arithmetical and logical operations on data ALU 
functions ("opcodes") must be available for all operations used in the HLL.sup. 1 : ALU 

25 functions have two data inputs A and B, and one data output X. Comparators have an event 

output U instead of the data output. They produce a 1 -event if the comparison is true, and a 0- 
event otherwise, .sup.l Otherwise programs containing operations which do not have ALU 
opcodes in the RDFP must be excluded from the supported HLL subset or substituted by 
"macros" of existing functions. [0739] 

30 * CNT: A counter function which has data inputs LB, UB and INC (lower bound, upper bound 
and increment) and data output X (counter value). A packet at event input START starts the 
counter, and event input NEXT causes the generation of the next output value (and output 
events) or causes the counter to terminate if UB is reached. If NEXT is not connected, the 
counter counts continuously. The output events U, V, and W have the following functionality: 
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For a counter counting N times, N-l 0-events and one 1 -event are generated at output U. At 
output V, N 0-events are generated, and at output W, N 0-events and one 1 -event are created. 
The 1 -event at W is only created after the counter has terminated, i.e. a NEXT event packet was 
received after the last data packet was output. [0710] 
5 * RAM[size]: The RAM function stores a fixed number of data words ("size"). It has a data 
input RD and a data output OUT for reading at address RD. Event output ERD signals 
completion of the read access. For a write access, data inputs WR and FN (address and value) 
and data output OUT is used. Event output EWR signals completion of the write access. ERD 
and EWR always generate 0-events. Note that external RAM can be handled as RAM functions 
1 0 exactly like internal RAM. [0711] 

* GATE: A GATE synchronizes a data packet at input A back and an event packet at input E. 
When both inputs have arrived, they are both consumed. The data packet is copied to output X, 
and the event packet to output U. [0712] 

* MUX: A MUX function has 2 data inputs A and B, an event input SEL, and a data output X. 
15 If SEL receives a 0-event, input A is copied to output X and input B discarded. For a 1 -event, B 

is copied and A discarded. [0713] 

* MERGE: A MERGE function has 2 data inputs A and B, an, event input SEL, and a data 
output X. If SEL receives a 0-event, input A is copied to output. X, but input B is not discarded. 
The packet is left at the input B instead. For a 1 -event, B is copied and A left at the input. 

20 [0711] 

* DEMUX: A DEMUX function has one data input A, an event input SEL, and two data 
outputs X and Y. If SEL receives a 0-event, input A is copied to output X, and no packet is 
created at output Y. For a 1 -event, A is copied to Y, and no packet is created at output [0715] 

* MDATA: A MDATA function multiplicates data packets. It has a data input A, an event input 
25 SEL, and a data output X. If SEL receives a 1 -event, a data packet at. A is consumed and copied 

to output X. For all subsequent 0-event at SEL, a copy of the input data packet is produced at 
the output without consuming new packets at A. Only if another 1 -event, arrives at SEL, the 
next data packet at A is consumed and copied.sup.2. .sup. 2 Note that this can be implemented 
by a MERGE with special properties on XPP.TM.. [0716] 
30 * INPORT[name]: Receives data packets from outside the RD FP through input port "name" 
and copies them to data output X. If a packet was received, a 0-event is produced at event 
output U, too. (Note that this function can only be configured at special objects connected to 
external busses.) [0717] 



NY01 1641442 



223 



MARKED-UP VERSION OF THE 
SUBSTITUTE SPECIFICATION 



* OUTPORTfname]: Sends data packets received at data input A to the outside of the RDFP 
through output port "name". If a packet was sent, a 0-event is produced at event output U, too. 
(Note that this function can only be configured at special objects connected to external busses.) 

5 [071 8] Additionally, the following functions manipulate only event packets: [071 9] 

* 0-FILTER, 1 -FILTER: A FILTER has an input E and an output U. A 0-FILTER copies a 0- 
event from E to U, but 1 -EVENTs at B are discarded. A 1 -FILTER copies 1 -events and discards 
0-events. [0750] 

* INVERTER: Copies all events from input B to output U but inverts its value. [0751] 

10 M)-CONSTANT, 1-CONSTANT: 0-CONSTANT copies all events from input E to output U, 
but changes them all to value 0. 1-CONSTANT changes all to value 1. [0752] 

* ECOMB: Combines two or more inputs El, E2, E3 . . . , producing a packet at output U. The 
output is a 1 -event if and only if one or more of the input packets are 1 -events (logical or). A 
packet must be available at all inputs before an output packet is produced.. sup. 3 .sup. 3 Note that 

15 this function is implemented by the EAND operator on the XPP.TM.. [0753] 

* ESEOfseql: An ESEQ generates a sequence "seq" of events, e.g. "0001", at its output U. If it 
has an input START, one entire sequence is generated for each event packet arriving at U. The 
sequence is only repeated if the next event arrives at U. However, if START is not connected, 
ESEQ constantly repeats the sequence. 

20 

[0751] N ote that ALU, MUX, DEMUX, GATE and ECOMB functions behave like their 
equivalents in classical dataflow machines [3, 4]. 

3.2 Packet-Based Communication Network 

25 

[0755] The communication network of an RDFP can connect an outputs of one object (i.e. its 
respective function) to the input(s) of one or several other objects. This is usually achieved by 
busses and, switches. By placing the functions properly on the objects, many functions can be 
connected arbitrarily up to a limit imposed by the device size. As mentioned above, all values 
30 are communicated as packets. A separate communication network exists for data and event 

packets. The packets synchronize the functions as in a dataflow machine with acknowledge [3]. 
I. e., the function only executes when all input packets are available (apart from the non-strict 
exceptions as described above). The function also stalls if the last output packet has not been 
consumed. Therefore a data-flow graph mapped to an RDFP self-synchronizes its execution 
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without the need for external control. Only if two or more function outputs (data or event) are 
connected to the same function input ("N to 1 connection"), the self-synchronization is 
disabled.. sup. 4 The user has to ensure that only one packet arrives at a time in a correct CDFG. 
Otherwise a packet might get lost, and the value resulting from combining two or more packets 
5 is undefined. However, a function output can be connected to many function inputs (" 1 to N 
connection") without problems, .sup. 4 Note that on XPP.TM. Cores, a "N to 1 connection" for 
events is realized by the EOR function, and for data by just assigning several outputs to an 
input. 

10 [0756] There are some special cases: [0757] 

* A function input can be preloaded with a distinct value during configuration. This packet is 
consumed like a normal packet coming from another object. [0758] 

* A function input can be defined as constant. In this case, the packet at the input is reproduced 
repeatedly for each function execution. 

15 

[0759] An RDFP requires register delays in the dataflow. Otherwise very long combinational 
delays and asynchronous feedback is possible. We assume that delays are inserted at the inputs 
of some functions (like for most ALUs) and in some routing segments of the communication 
network. Note that registers change the timing, but not the functionality of a correct CDFG. 

20 

4 Configuration Generation 
4.1 Language Definition 

25 [0760] The following HLL features are not supported by the method described here: [0761] 

* pointer operations [0762] 

* library calls, operating system calls (including standard I/O functions) [0763] 

* recursive function calls (Note that non-recursive function calls can be eliminated by function 
inlining and therefore are not considered here.) [0761] 

30 * All scalar data types are converted to type integer. Integer values are equivalent to data 

packets in the RDFP. Arrays (possibly multi-dimensional) are the only composite data types 
considered. 

[0765] The following additional features are supported: 
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[0766] INPORTS and OUTPORTS can be accessed by the HLL functions getstream(name, 
value) and putstream(name, value) respectively. 

5 4.2 Mapping of High-Level Language Constructs 

[0767] This method converts a HLL program to a CDFG consisting of the RDFP functions 
defined in Section 3.1. Before the processing starts, all HLL program arrays are mapped to 
RDFP RAM functions. An array x is mapped to RAM RAM(x). If several arrays are mapped to 
10 the same RAM, an offset is assigned, too. The RAMs are added to an initially empty CDFG. 
There must be enough RAMs of sufficient size for all program arrays. 

[0768] The CDFG is generated by a traversal of the AST of the HLL program. It processes the 
program statement by statement and descends into the loops and conditional statements as 
15 appropriate. The following two pieces of information are updated at every program point.sup.5 
during the traversal: .sup. 5 In a program, program points are between two statements or before 
the beginning or after the end of a program component like a loop or a conditional statement. 
[0769] 

* START points to an event output of a RDFP function. This output delivers a 0-event 
20 whenever the program execution reaches this program point. At the beginning, a 0- 

CONSTANT preloaded with an event input is added to the CDFG. (It delivers a 0-event 
immediately after configuration.) START initially points to its output. This event is used to start 
the overall program execution. The START. sub .new signal generated after a program part has 
finished executing is used as new START signal for the following program parts, or it signals 
25 termination of the entire program. The START events guarantee that the execution order of the 
original program is maintained wherever the data dependencies alone are not sufficient. This 
scheduling scheme is similar to a one-hot controller for digital hardware. [0770] 

* VARLIST is a list of {variable,function-output} pairs. The pairs map integer variables or 
array elements to a CDFG function's output. The first pair for a variable in VARLIST contains 

30 the output of the function which produces the value of this variable valid at the current program 
point. New pairs are always added to the front of VARLIST. The expression VARDEF(var) 
refers to the function-output of the first pair with variable var in VARLIST.. sup. 6 .sup. 6 This 
method of using a VARLIST is adapted from the Transmogrifier C compiler [5]. 
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[0771] The following subsections systematically list all HLL program components and describe 
how they are processed, thereby altering the CDFG, START and VARLIST. 

4.2.1 Integer Expressions and Assignments 

5 

[0772] Straight-line code without array accesses can be directly mapped to a data-flow graph. 
One ALU is allocated for each operator in the program. Because of the self-synchronization of 
the ALUs, no explicit control or scheduling is needed. Therefore processing these assignments 
does not access or alter START. The data dependences (as they would be exposed in the DAG 
10 representation of the program [1]) are analyzed through the processing of VARLIST. These 
assignments synchronize themselves through the data-flow. The data-driven execution 
automatically exploits the available instruction level parallelism. 

[0773] All assignments evaluate the right-hand side (RHS) or source expression. This 
15 evaluation results in a pointer to a CDFG object's output (or pseudo-object as defined below). 
For integer assignments, the left-hand side (LHS) variable or destination is combined with the 
RHS result object to form a new pair {LHS, result(RHS)} which is added to the front of 
VARLIST. 

20 [0771] The simplest statement is a constant assigned to an integer.. sup. 7 .sup. 7 Note that we use 
C syntax for the following examples. 

[0775] a=5; 

25 [0776] It doesn't change the CDFG, but adds {a, 5} to the front of VARLIST. The constant 5 is 
a "pseudoobject" which only holds the value, but does not refer to a CDFG object. Now 
VARDEF(a) equals 5 at subsequent program points before a is redefined. 

[0777] Integer assignments can also combine variables already defined and constants: 

30 

[0778] b=a*2+3; 

[0779] In the AST, the RHS is already converted to an expression tree. This tree is transformed 
to a combination of old and new CDFG objects (which are added to the CDFG) as follows: 
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Each operator (internal node) of the tree is, substituted by an ALU with the opcode 
corresponding to the operator in the tree. If a leaf node is a constant, the ALU's input is directly 
connected to that constant. If a leaf note is an integer variable var it is looked up in VARLIST, 
i.e. VARDEF(var) is retrieved. Then VARDEF(var) (an output of an already existing object in 
5 CDFG or a constant) is connected to the ALU's input. The output of the ALU corresponding to 
the root operator in the expression tree is defined as the result of the RHS. Finally, a new pair 
{LHS, result(RHS)} is added to VARLIST. If the two assignments above are processed, the 
CDFG with two ALUs in FIG. 34 is created.. sup. 8 Outputs occurring in VARLIST are labeled 
by Roman numbers. After these two assignments, VARLIST [{b, I}, {a, 5}]. (The front of the 

10 list is on the left side.) Note that all inputs connected to a constant (whether direct from the 
expression tree or retrieved from VARLIST) must be defined as constant. Inputs defined as 
constants have a small c next to the input arrow in FIG. 34. .sup. 8 Note that the input and output 
names can be deduced from their position, cf. FIG. 33. Also note that the compiler frontend 
would normally have substituted the second assignment by b=13 (constant propagation). For the 

1 5 simplicity of this explanation, no frontend optimizations are considered in this and the 
following examples. 

4.2.2 Conditional Integer Assignments 

20 [0780] For conditional if-then-else statements containing only integer assignments, objects for 
condition evaluation are created first. The object event output indicating the condition result is 
kept for choosing the correct branch result later. Next, both branches are processed in parallel, 
using separate copies VARLIST 1 and VARLIST2 of VARLIST. (VARLIST itself is not 
changed.) Finally, for all variables added to VARLIST 1 or VARLIST2, a new entry for 

25 VARLIST is created (combination phase). The valid definitions from VARLIST 1 and 
VARLIST2 are combined with a MUX function, and the correct input is selected by the 
condition result. For variables only defined in one of the two branches, the multiplexer uses the 
result retrieved from the original VARLIST for the other branch. If the original VARLIST does 
not have an entry for this variable, a special "undefined" constant value is used. However, in a 

30 functionally correct program this value will never be used. As an optimization, only variables 
live [1] after the if-then-else structure need to be added to VARLIST in the combination 
phase. .sup. 9 .sup. 9 Definition: A variable is live at a program point if its value is read at a 
statement reachable from here without intermediate redefinition. 
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[0781] Consider the following example: TABLE US 00221 

i = 7; 
a = 3; 

if (i < 10) { 

5 a = 5; 

c = 7; 

} 

else { 

c = a - 1; 
10 d = 0; 

} 

[0782] FIG. 35 shows the resulting CDFG. Before the if-then-else construct, VARLIST=[{a, 
3}, {i, 7}]. After processing the branches, for the then branch, VARLISTl=[{c, 7}, {a, 5}, {a, 
15 3}, {i, 7}], and for the else branch, VARLIST2=[{d, 0}, {c, I}, {a, 3}, {i, 7.}]. After 
combination, VARLIST=[{d, II}, {c, III}, {a, IV}, {a, 3}, {i, 7}]. 

[0783] N ote that case- or switch-statements can be processed, too, since they can— without loss 
of generality— be converted to nested if-then-else statements. 

20 

[0781] Processing conditional statements this way does not require explicit control and does not 
change START. Both branches are executed in parallel and synchronized by the data-flow. It is 
possible to pipeline the dataflow for optimal throughput. 

25 4.2.3 General Conditional Statements 

[0785] Conditional statements containing either array accesses (cf. Section 4.2.7 below) or inner 
loops cannot be processed as described in Section 4.2.2. Data packets must only be sent to the 
active branch. This is achieved by the implementation shown in FIG. 40, similar to the method 
30 presented in [4]. 

[0786] A dataflow analysis is performed to compute used sets use and defined sets def [1] of 
both branches.. sup. 10 For the current VARLIST entries of all variables in 
IN=use(thenbody).orgate.def(then-body).orgate.use (elsebody).orgate.def(elsebody).orgate.use 
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(header), DEMUX functions controlled by the IF condition are inserted. Note that arrows with 
double lines in FIG. 40 denote connections for all variables in IN, and the shaded DEMUX 
function stands for several DEMUX functions, one for each variable in FN. The DEMUX 
functions forward data packets only to the selected branch. New lists VARLIST1 and 
5 VARLIST2 are compiled with the respective outputs of these DEMUX functions. The then- 
branch is processed with VARLIST1, and the else branch with VARLIST2. Finally, the output 
values are combined. OUT contains the new values for the same variables as in IN. Since only 
one branch is ever activated there will not be a conflict due to two packets arriving 
simultaneously. The combinations will be added to VARLIST after the conditional, statement. 
10 If the IF execution shall be pipelined, MERGE opcodes for the output must be inserted, too. 

They are controlled by the condition like the DEMUX functions, .sup. 10 A variable is used in a 
statement (and hence in a program region containing this statement) if its value is read. A 
variable is defined in a statement (or region) if a new value is assigned to it. 

15 [0787] The following extension with respect to [4] is added (dotted lines in FIG. 40) in order to 
control the execution as mentioned above with START events: The START input is ECOMB- 
combined with the condition output and connected to the SEL input of the DEMUX functions. 
The START inputs of thenbody and elsebody are generated from the ECOMB output sent 
through a 1 -FILTER and a 0-CONSTANT.sup.l 1 or through a 0-FILTER, respectively. The 

20 overall START. sub.new output is generated by a simple "2 to 1 connection" of thenbody's and 
elsebody's START. sub.new outputs. With this extension, arbitrarily nested conditional 
statements or loops can be handled within thenbody and elsebody. .sup.l 1 The 0-CONSTANT 
is required since START events must always be 0-events. 

25 4.2.4 WHILE Loops 

[0788] WHILE loops are processed similarly to the scheme presented in [4], cf. FIG. 41 . As in 
Section 4.2.3, double line connections and shaded MERGE and DEMUX functions represent 
duplication for all variables in IN. Here 
30 IN=use(whilebody).orgate.def(whilebody).orgate.use(header). The WHILE loop executes as 

follows: In the first loop iteration, the MERGE functions select all input values from VARLIST 
at loop entry (SEL=0). The MERGE outputs are connected to the header and the DEMUX 
functions. If the while condition is true (SEL=1), the input values are forwarded to the 
whilebody, otherwise to OUT. The output values of the while body are fed back to whilebody's 
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input via the MERGE and DEMUX operators as long as the condition is true. Finally, after the 
last iteration, they are forwarded to OUT. The outputs are added to the new VARLIST..sup.l2 
.sup. 12 Note that the MERGE function for variables not live at the loop's beginning and the 
whilebody's beginning can be removed since its output is not used. For these variables, only the 
5 DEMUX function to output the final value is required. Also note that the MERGE functions can 
be replaced by simple "2 to 1 connections" if the configuration process guarantees that packets 
from INI always arrive at the DEMUX's input before feedback values arrive. 

[0789] Two extensions with respect to [4] are added (dotted lines in FIG. 41): [0790] 
10 * In [4], the SEL input of the MERGE functions is preloaded with 0. Hence the loop execution 
begins immediately and can be executed only once. Instead, we connect the START input to the 
MERGE's SEL input ("2 to 1 connection" with the header output). This allows to control the 
time of the start of the loop execution and to restart it. [0791] 

^_The whilebody's START input is connected to the header output, sent through a 1 -FILTER/0- 
15 CONSTANT combination as above (generates a 0-event for each loop iteration). By ECOMB- 
combining whilebody's START. sub.new output with the header output for the MERGE 
functions' SEL inputs, the next loop iteration is only started after the previous one has finished. 
The while loop's ST ART. sub .new output is generated by filtering the header output for a 0- 
event. 

20 

[0792] With these extensions, arbitrarily nested conditional statements or loops can be handled 
within whilebody. 

4.2.5 FOR Loops 

25 

[0793] FOR loops are particularly regular WHILE loops. Therefore we could handle them as 
explained above. However, our RDFP features the special counter function CNT and the data 
packet multiplication function MDATA which can be used for a more efficient implementation 
of FOR loops. This new FOR loop scheme is shown in FIG. 42. 

30 

[0791] A FOR loop is controlled by a counter CNT. The lower bound (LB), upper bound (UB), 
and increment (INC) expressions are evaluated like any other expressions (see Sections 4.2.1 
and 4.2.7) and connected to the respective inputs. 
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[0795] As opposed to WHILE loops, a MERGE/DEMUX combination is only required for 
variables in INl=def(forbody), i.e. those defined in forbody..sup.l3 INI does not contain 
variables which are only used in forbody, LB, UB, or INC, and does also not contain the loop 
index variable. Variables in INI are processed as in WHILE loops, but the MERGE and 
5 DEMUX functions' SEL input is connected to CNT's W output. (The W output does the inverse 
of a WHILE loop's header output; it outputs a 1 -event after the counter has terminated. 
Therefore the inputs of the MERGE functions and the outputs of the DEMUX functions are 
swapped here, and the MERGE functions' SEL inputs are preloaded with 1 -events.) .sup. 13 
Note that the MERGE functions can be replaced by simple "2 to 1 connections" as for WHILE 
10 loops if the configuration process guarantees that packets from INI always arrive at the 
DEMUX's input before feedback values arrive. 



[0796] CNT's X output provides the current value of the loop index variable. If the final index 
value is required (live) after the FOR loop, it is selected with a DEMUX function controlled by 
15 CNT's U event output (which produces one event for every loop iteration). 



[0797] Variables in IN2=use(forbody)\def( forbody), i.e. those defined outside the loop and only 
used (but not redefined) inside the loop are handled differently. Unless it is a constant value, the 
variable's input value (from VARLIST) must be reproduced in each loop iteration since it is 
20 consumed in each iteration. Otherwise the loop would stall from the second iteration onwards. 
The packets are reproduced by MDATA functions, with the SEL inputs connected to CNT's U 
output. The SEL inputs must be preloaded with a 1 -event to select the first input. The 1 -event 
provided by the last iteration selects a new value for the next execution of the entire loop. 



25 [0798] The following control events (dotted lines in FIG. 42) are similar to the WHILE loop 
extensions, but simpler. CNT's START input is connected to the loop's overall START signal. 
START. sub. new is generated from CNT's W output, sent through a 1 -FILTER and 0- 
CONSTANT. CNT's V output produces one 0-event for each loop iteration and is therefore 
used as forbody's START. Finally, CNT's NEXT input is connected to forbody's 

30 START. sub. new output. 



[0799] For pipelined loops (as defined below in Section 4.2.6), loop iterations are allowed to 
overlap. Therefore CNT's NEXT input needs not be connected. Now the counter produces index 
variable values and control events as fast as they can be consumed. However, in this case CNT's 
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W output in not sufficient as overall START. sub .new output since the counter terminates before 
the last iteration's forbody finishes. Instead, START. sub. new is generated from CNT's U output 
ECOMB-combined with forbody's START. sub .new output, sent through a 1-FILTER/O- 
CONSTANT combination. The ECOMB produces an event after termination of each loop 
5 iteration, but only the last event is a 1 -event because only the last output of CNT's U output is a 
1 -event. Hence this event indicates that the last iteration has finished. Cf. Section 4.3 for a FOR 
loop example compilation with and without pipelining. 

[0800] As for WHILE loops, these methods allow to process arbitrarily nested loops and 
10 conditional statements. The following advantages over WHILE loop implementations are 
achieved: [0801] 

f_One index variable value is generated by the CNT function each clock cycle. This is faster 
and smaller than the WHILE loop implementation which allocates a MERGE/DEMUX/ADD 
loop and a comparator for the counter functionality. [0802] 
1 5 * Variables in IN2 (only used in forbody) are reproduced in the special MDATA functions and 
need not go through a MERGE/DEMUX loop. This is again faster and smaller than the WHILE 
loop implementation. 

4.2.6 Vectorization and Pipelining 

20 

[0803] The method described so far generates CDFGs performing the HLL program's 
functionality on an RDFP. However, the program execution is unduly sequentialized by the 
START signals. In some cases, innermost loops can be vectorized. This means that loop 
iterations can overlap, leading to a pipelined dataflow through the operators of the loop body. 
25 The Pipeline Vectorization technique [6] can be easily applied to the compilation method 

presented here. As mentioned above, for FOR loops, the CNT's NEXT input is removed so that 
CNT counts continuously, thereby overlapping the loop iterations. 

[0801] All loops without array accesses can be pipelined since the dataflow automatically 
30 synchronizes loopcarried dependences, i.e. dependences between a statement in one iteration 
and another statement in a subsequent iteration. Loops with array accesses can be pipelined if 
the array (i.e. RAM) accesses do not cause loop-carried dependences or can be transformed to 
such a form. In this case no RAM address is written in one and read in a subsequent iteration. 
Therefore the read and write accesses to the same RAM may overlap. This degree of freedom is 
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exploited in the RAM access technique described below. Especially for dual-ported RAM it 
leads to considerable performance improvements. 

4.2.7 Array Accesses 

5 

[0805] In contrast to scalar variables, array accesses have to be controlled explicitly in order to 
maintain the program's correct execution order. As opposed to normal dataflow machine models 
[3], a RDFP does not have a single address space. Instead, the arrays are allocated to several 
RAMs. This leads to a different approach to handling RAM accesses and opens up new 
1 0 opportunities for optimization. 

[0806] To reduce the complexity of the compilation process, array accesses are processed in 
two phases. Phase 1 uses "pseudo-functions" for RAM read and write accesses. A RAM read 
function has a RD data input (read address) and an OUT data output (read value), and a RAM 

15 write function has WR and IN data inputs (write address and write value). Both functions are 
labeled with the array the access refers to, and both have a START event input and a U event 
output. The events control the access order. In Phase 2 all accesses to the same RAM are 
combined and substituted by a single RAM function as shown in FIG. 33t43. This involves 
manipulating the data and event inputs and outputs such that the correct execution order is 

20 maintained and the outputs are forwarded to the correct part of the CDFG. 

[0807] Phase 1 Since arrays are allocated to several RAMs, only accesses to the same RAM 
have to be synchronized. Accesses to different RAMs can occur concurrently or even out of 
order. In case of data dependencies, the accesses self-synchronize automatically. Within 

25 pipelined loops, not even read and write accesses to the same RAM have to be synchronized. 
This is achieved by maintaining separate START signals for every RAM or even separate 
START signals for RAM read and RAM write accesses in pipelined loops. At the end of a basic 
block [1]. sup. 14, all START. sub .new outputs must be combined by a ECOMB to provide a 
START signal for the next basic block which guarantees that all array accesses in the previous 

30 basic block are completed. For pipelined loops, this condition can even be relaxed. Only after 
the loop exit all accesses have to be completed. The individual loop iterations need not be 
synchronized, .sup. 14 A basic block is a program part with a single entry and a single exit point, 
i.e. a piece of straight-line code. 
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[0808] First the RAM addresses are computed. The compiler frontend's standard transformation 
for array accesses can be used, and a CDFG function's output is generated which provides the 
address. If applicable, the offset with respect to the RDFP RAM (as determined in the initial 
mapping phase) must be added. This output is connected to the pseudo RAM read's RD input 
5 (for a read access) or to the pseudo RAM write's WR input (for a write access). Additionally, 
the OUT output (read) or IN input (write) is connected. The START input is connected to the 
variable's START signal, and the U output is used as START. sub .new for the next access. 

[0809] To avoid redundant read accesses, RAM reads are also registered in VARLIST. Instead 
10 of an integer variable, an array element is used as first element of the pair. However, a change 
in a variable occurring in an array index invalidates the information in VARLIST. It must then 
be removed from it. 

[0810] The following example with two read accesses compiles to the intermediate CDFG 
15 shown in FIG. 44. The START signals refer only to variable a. STOP1 is the event connection 
which synchronizes the accesses. Inputs START (old), i and j should be substituted by the 
actual outputs resulting from the program before the array reads. 

[0811] x=a [i] 
20 f0g43j-y=a|j] 
[0813] z=x+y; 

[0811] FIG. 45 shows the translation of the following write access: 
25 [0815] a[i] x; 

[0816] Phase 2 We now merge, the pseudo-functions of all accesses to the same RAM and 
substitute them by a single RAM function. For all data inputs (RD for read access and WR and 
IN for write access), GATEs are inserted between the input and the RAM function. Their E 
30 inputs are connected to the respective START inputs of the original pseudo-functions. If a RAM 
is read and written at only one program point, the U output of the read and write access is 
moved to the ERD or EWR output, respectively. For example, the single access a [i=x; from 
FIG. 45 is transformed to the final CDFG shown in FIG. 37. 
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[0817] However, if several read or several write accesses (i.e. pseudo-functions from different 
program points) to the same RAM occur, the ERD or EWR events are not specific anymore. 
But a START. sub .new event of the original pseudo function should only be generated for the 
respective program point, i.e. for the current access. This is achieved by connecting the START 
5 signals of all other accesses (pseudo-functions) of the same type (read or write) with the 

inverted START signal of the current access. The resulting signal produces an event for every, 
access, but only for the current access a 1 -event. This event is ECOMB-combined with the 
RAM's ERD or EWR output. The ECOMB's output will only occur after the access is 
completed. Because ECOMB OR-combines its event packets, only the current access produces 
10 a 1 -event. Next, this event is filtered with a 1 -FILTER and changed by a 0-CONSTANT, 

resulting in a START. sub .new signal which produces a 0-event only after the current access is 
completed as required. 

[0818] For several accesses, several sources are connected to the RD, WR and FN inputs of a 
15 RAM. This disables the self-synchronization. However, since only one access occurs at a time, 
the GATEs only allow one data packet to arrive at the inputs. 

[0819] For read accesses, the packets at the OUT output face the same problem as the ERD 
event packets: They occur for every read access, but must only be used (and forwarded to 

20 subsequent operators) for the current access. This can be achieved by connecting the OUT 

output via a DEMUX function. The Y output of the DEMUX is used, and the X output is left 
unconnected. Then it acts as a selective gate which only forwards packets if its SEL input 
receives a 1 -event, and discards its data input if SEL receives a 0-event. The signal created by 
the ECOMB described above for the START. sub .new signal creates a 1 -event for the current 

25 access, and a 0-event otherwise. Using it as the SEL input achieves exactly the desired 
functionality. 

[0820] FIG. 36 shows the resulting CDFG for the first example above (two read accesses), after 
applying the transformations of Phase 2 to FIG. 44. STOP1 is now generated as follows: 
30 START(old) is inverted, "2 to 1 connected" to STOP1 (because it is the START input of the 
second read pseudo-function), ECOMB-combined with RAM's ERD output and sent through 
the l-FILTER/0-CONSTANT combination. START(new) is generated similarly, but here 
START(old) is directly used and STOP1 inverted. The GATEs for input IN (i and j) are 
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connected to START (old) and STOP1, respectively, and the DEMUX functions for outputs x 
and y are connected to the ECOMB outputs related to STOP1 and START (new). 



[0821] Multiple write accesses use the same control events, but instead of one GATE per access 
5 for the RD inputs, one GATE for WR and one gate for IN (with the same E input) are used. The 
EWR output is processed like the ERD output for read accesses. 

[0822] This transformation ensures that all RAM accesses are executed correctly, but it is not 
very fast since read or write accesses to the same RAM are not pipelined. The next access only 
10 starts after the previous one is completed, even if the RAM being used has several pipeline 
stages. This inefficiency can be removed as follows: 

[0823] First continuous sequences of either read accesses or write accesses (not mixed) within a 
basic block are detected by checking for pseudo-functions whose U output is directly connected 

15 to the START input of another pseudo-function of the same RAM and the same type (read or 

write). For these sequences it is possible to stream data into the RAM rather than waiting for the 
previous access to complete. For this purpose, a combination of MERGE functions selects the 
RD or WR and IN inputs in the order given by the sequence. The MERGEs must be controlled 
by iterative ESEQs guaranteeing that the inputs are only forwarded in the desired order. Then 

20 only the first access in the sequence needs to be controlled by a GATE or GATEs. Similarly, the 
OUT outputs of a read access can be distributed more efficiently for a sequence. A combination 
of DEMUX functions with the same ESEQ control can be used. It is most efficient to arrange 
the MERGE and DEMUX functions as balanced binary trees. 

25 [0821] The START. sub.new signal is generated as follows: For a sequence of length n, the 

START signal of the entire sequence is replicated n times by an ESEQ [00.. 1] function with the 
START input connected to the sequence's START. Its output is directly "N to 1 connected" 
with the other accesses' START signal (for single accesses) or ESEQ outputs sent through 0- 
CONSTANT (for access sequences), ECOMB -connected to EWR or ERD, respectively, and 

30 sent through a 1-FILTER/O-CONSTANT combination, similar to the basic method described 
above. Since only the last ESEQ output is a 1 -event, only the last RAM access generates a 
START, sub. new as required. Alternatively, for read accesses, the generation of the last output 
can be sent through a GATE (without the E input connected), thereby producing a 
START. sub. new event. 
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[0825] FIG. 46 shows the optimized version of the first example (FIGS. 44 and 36) using the 
ESEQ-method for generating START. sub .new and FIG. 38 shows the final CDFG of the 
following, larger example with three array reads. Here the latter method for producing the 
5 START. sub. new event is used. 

[0826] x=a[i]; 
{083^y=a[j]; 
[0828] z=a[k]; 

10 

[0829] If several read sequences or read sequences and single read accesses occur for the same 
RAM, 1 -events for detecting the current accesses must be generated for sequences of read 
accesses. They are needed to separate the OUT -values relating to separate sequences. The 
ESEQ output just defined, sent through a 1 -CONSTANT, achieves this. It is again "N to 1 
15 connected" to the other accesses' START signals (for single accesses) or ESEQ outputs sent 
through 0-CONSTANT (for access sequences). The resulting event is used to control a first- 
stage DEMUX which is inserted to select the relevant OUT output data packets of the sequence 
as described above for the basic method. Refer to the second example (FIGS. 47 and 48) in 
Section 4.3 for a complete example. 

20 

4.2.8 Input and Output Ports 

[0830] Input and output ports are processed similar to vector accesses. A read from an input 
port is like an array read without an address. The input data packet is sent to DEMUX functions 
25 which send it to the correct subsequent operators. The STOP signal is generated in the same 
way as described above for RAM accesses by combining the INPORT's U output with the 
current and other START signals. 

[0831] Output ports control the data packets by GATEs like array write accesses. The STOP 
30 signal is also created as for RAM accesses. 

4.3 More Examples 

[0832] FIG. 39 shows the generated CDFG for the following for loop. TABLE US 00225 
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a = b + c; 

for (i=0; i<=10;i++) { 
a = a + i; 
x[i] = k; 

5 } 

[0833] In this example, INl = {a} and IN2={k} (cf. FIG. 42). The MERGE function for variable 
a is replaced by a 2:1 data connection as mentioned in the footnote of Section 4.2.5. Note that 
only one data packet arrives for variables b, c and k, and one final packet is produced for a (out) 

10 forbody does not use a START event since both operations (the adder and the RAM write) are 
dataflow-controlled by the counter anyway. But the RAM's EWR output is the forbody's 
START. sub. new and connected to CNT's NEXT input. Note that the pipelining optimization, 
cf. Section 4.2.6, was not applied here. If it is applied (which is possible for this loop), CNT's 
NEXT input is not connected, cf. FIG. 43. Here, the loop iterations overlap. START. sub.new is 

15 generated from CNT's U output and forbody's START. sub. new (i.e. RAM's EWR output), as 
defined at the end of Section 4.2.5. 

[0831] The following program contains a vectorizable (pipelined) loop with one write access to 
array (RAM) x and a sequence of two read accesses to array (RAM) y. After the loop, another 
20 single read access to y occurs. TABLE US 00226 

z = 0; 

for (i=0; i<=10;i++) { 
x[i] = i; 

z = z + y[i] +y[2*i]; 

25 } 

a = y[k]; 

[0835] FIG. 47 shows the intermediate CDFG generated before the array access Phase 2 
transformation is applied. The pipelined loop is controlled as follows: Within the loop, separate 
30 START signals for write accesses to x and read accesses to y are used. The reentry to the 

forbody is also controlled by two independent signals ("cyclel" and "cycle2"). For the read 
accesses, "cycle2" guarantees that the read y accesses occur in the correct order. But the 
beginning of an iteration for read y and write x accesses is not synchronized. Only at loop exit 
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all accesses must be finished, which is guaranteed by signal, "loop finished". The single read 
access is completely independent of the loop. 

[0836] FIG. 48 shows the final CDFG after Phase 2. Note that "cyclel" is removed since a 
5 single write, access needs no additional control, and "cycle2" is removed since the inserted 

MERGE and DEMUX functions automatically guarantee the correct execution order. The read 
y accesses are not independent anymore since they all refer to the same RAM, and the functions 
have been merged. ESEQs have been allocated to control the MERGE and DEMUX functions 
of the read sequence, and for the first-stage DEMUX functions which separate the read OUT 
10 values for the read sequence and for the final single read access. The ECOMBs, 1-FILTERs, 0- 
CONSTANTs and 1 -CONSTANTS are allocated as described in Section 4.2.7, Phase 2, to 
generate correct control events for the GATEs and DEMUX functions. 
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[0813] In connection with the coupling of an array and a processor, the following is noted as 
well: 

30 [0811] In coupling the XPP or any other data processing array having a number of preferably 
coarse grain cells to a conventional (that is sequential and/or von Neumann-) 
processor/microcontroller design, a number of op code instructions may be added to the 
instructions set of the conventional processor. A non-limiting example is given below and it will 
be obvious to the average skilled person that it is not intended to limit the invention but disclose 
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certain aspects thereof in more detail, the aspects being of more or less importance. For 
example, it may be the case that other bit lengths than indicated for instructions are used. It is 
also to be understood that the mnemonics might be changed and that in certain cases additional 
instructions and/or operations might be useful whereas in other cases or for other subset 
5 of the instructions indicated below might be useful as well. For example, it is easily possible to 
combine one or more XPP or any other reconfigurable device or set or group of identical or 
different devices, in particular runtime reconfigurable and/or coarse grain devices, FPGA and or 
data streaming processors with any design other than the LEON processor and/or a processor 
using SPARC instructions. Also, the use of the instruction set is not limited to certain compiling 

10 algorithms although the compiling techniques disclosed in other parts of the present invention 

are very useful. It is to be noted that one preferred way of using the XPP or other reconfigurable 
device or set or group of identical or different devices, in particular runtime reconfigurable 
add/or coarse grain devices, FPGA and or data streaming processors coupled to a design such as 
the LEON processor and/or other conventional processor is the use of macro libraries so that 

15 predefined configurations can be instantiated and/or called as subroutines. These libraries may 
be automatically compiled and/or the configurations corresponding thereto may be set up by 
hand. This being noted, with respect to additional op-code instructions the following is noted: 

[0815] All additional instructions refer to format 3 of the SPARC instruction set, the op index 
20 being 3. The SPARC specification uses this format for the declaration of memory accesses. As 
in the original instruction set a plurality of op-codes had not been implemented, there was an 
opportunity to use the free fields for dedicated purposes. 

[0816] Also, it was possible to ensure completeness of instructions; for example, no memory 
25 access instruction is located inbetween arithmetic instructions. 

[0817] Overview over the SPARC instruction format 3 TABLE US 00227 op rd op3 rol i = 0 
Asi rs2 op rd op3 rsl i ~ 1 simml3 op rd op3 rsl Opf rs2 31 29 21 18 13 12 1 0 

oprdop3rsli=0Asirs2nprdop3rsl .i=lsimml3oprdop3rslOpfrs231 29 24 18 

30 13 12 4 0 

[0818] Here, the abbreviations have the following meaning: [0819] 

rd: This field is five bit long. It contains the address of the source or target register, arithmetic 
and for Load-/Store-operations. [0850] 

N Y0 1 1 64 1 442 241 MARKED-UP VERSION OF THE 

SUBSTITUTE SPECIFICATION 



op3: This field is six bit long. Together with the op field it builds the instructions. [0851] 
rsl : This field is five bit long. It contains the first operand of an ALU-operation. [0852] 
opf: This field is nine bit long and contains the instructions of a floating point operation. [0853] 
i: This is a one-bit-field selecting the second operand for arithmetic or Load-/Store-operations 
5 respectively. In case i=l, the operand is the content of simml3, otherwise the operand is the 
content ofrs2. [0851] 

asi: This field is eight bit long and indicates the address space which is accessed by Load- 
/Store-operations. [0855] 

siml3: This field is thirteen bit long and contains the second operand of an arithmetic and/or 
10 Load-/Store-operation, the operand having a sign (+, -). [0856] 

rs2: This field is five bit long and corresponds to the operand of an arithmetic and/or Load- 
/Store-operation respectively. It does not have a sign (+, -). 

[0857] Overview Over Additional Instructions TABLE US 00228 

1 5 Opcode Meaning privileged stxppd Write privilegedstxppdWrite word from memory to 

an XPP data register no ldxppd Load word from memory to an XPP data register no stxppe 
Write word from XPP event register into memory no ldxppo Load word from memory into XPP 
event register no Idem Load registernoldxppdLoad word from memory to an XPP data 
registernostxppeWrite word from XPP event register into memorynoldxppeLoad word from 

20 memory into XPP event registernoldcmLoad word from memory into CM register yes stem 
W-rit ere gisteryesstcm Write word from cm register into memory yes cptoxppd 
Gep ymemoryyescptoxppdCopy a word from a LEON register into an XPP data register no 
cptoloond Cop y registernocptoleondCopy a word from an XPP register into a LEON data 
register no cptoxppe Copy a word from a LEON register into an XPP event register no 

25 cptoleone Copy a word from an XPP register into a LEON event register no cptocm Copy a 

word from a LEON register into a CM register yes cptolconcm Copy a word from a CM register 
into an LEON register yes cptolconsdi Cop y . LEON data registernocptoxppeCopy a word from 
a LEON register into an XPP event registernocptoleoneCopv a word from an XPP register into 
a LEON event registernocptocmCopv a word from a LEON register into a CM 

30 registeryescptoleoncmCopy a word from a CM register into an LEON 

registeryescptoleonsdiCopy a word from the status register of an XPP data input-4ie 
register into a LEON register cptoleonsdo Cop y registernocptoleonsdoCopy a word from the 
status register of an XPP data output-^ie 
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register into a LEON register cptoloonsoi Copy n ocptoleonseiCopy a word from the status 
register of an XPP event input-^ie 

register into a LEON register cptoleonseo Copy r egisternocptoleonseoCopy a word from the 
status regiser of an XPP event output-^ie 
5 register into a LEON register wrcllcr Writo r egistemowrclkr Write into a clock register to 
determine clock ratio LEON- XPP yes wroffaotr Writc X PPyeswroffsetrWrite into memory 
offset register for memory mapped mode yes rdclkr Read m odevesrdclkrRead clock register for 
clock ration LEON- XPP yes rdoffsetr Rea d XPPyesrdoffsetrRead memory offset register for 
memory mapped mode yos rdtrapr Road .yesrdtraprRead register with information about XPP 
10 trap yos trapyes 

[0858] Data Transfer Between LEON and XPP TABLE US 00229 Opcode op3 Operation 
cptoxppd 101110 

Opcode op3 Operationcptoxppdl 0 111 O Copy a word from a LEON register into an 

XPP data register cptoloond registercptoleond lOl 111 Copy a word from an XPP register into a 
1 5 LEON data register cptoxppe registercptoxppe l 10010 Copy a word from a LEON register into 
an XPP event register cptoleone registercptoleone 1 1 00 1 1 Copy a word from an XPP register 
into a LEON event register [0859] Format (3): TABLE US 00230 11 rd op3 Rsl 
gisterFormat (3): 

1 lrdop3Rslrxpp (opf) rsS^j-231 29 24 18 13 12 4 0 

20 

[0860] Assembler Syntax: TABLE US 00231 cptoxppd reg.sub.rd, reg.sub.rxpp cptoleond 
rog.sub.rxpp, rog.sub.rd cptoxppe rog.sub.rd, reg.sub.rxpp cptoloono rog.sub.rxpp, rog.sub.rd 
Assembler Syntax: 

cptoxppdregrd, regrxppcptoleondregrxpp, regrdcptoxpperegrd, regrxppcptoleoneregrxpp, regrd 

25 t 

[0861] Description 



[0862] CPTOXPPD loads a word from r[rd] to the data register r[rxpp] of XPP architecture. 
[0863] CPTOLEOND loads a word from a data register r[rxpp] of XPP architecture to r[rd]. 
30 [0861] CPTOXPPE loads a word from r[rd] to event register r[rxpp] of XPP architecture. 

[0865] CPTOLEONE loads a word from event register r[rxpp] of XPP architecture to r[rd]. 



[0866] Traps: 

[0867] xpp readaccess error 
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[0868] xpp writeaccess error 
[0869] xppregnotexisterror 

[0870] Data Transfer Between LEON and CM TABLE US 00232 

5 Opcode op3 Operation cptocm op3 Operationcptocm l 10110 Load word from memory 

into CM register cptolooncm registercptoleoncm l 10111 Load word from CM register into 
LEON register 

[0871] Format (3): TABLE US 00233 11 rd op3 Rsl rem 

1 lrdop3Rslrcm (opf) rs2-^4-231 29 24 18 13 12 4 0 

10 

[0872] Assembler Syntax: TABLE US 00231 cptocm reg. sub .rd, reg.sub.rcm cptoleoncm 
reg.sub.rcm, reg.sub.rd 

cptocmregrd, regrcmcptoleoncmregrcm, regrd 
[0873] Description: 
1 5 [0871] CPTOCM loads a word of r[rd] into a register r[rcm] of CM. 

£©&7£J-CPTOLEONCM loads a word from register r[rcm] of CM to r[rd]. 

[0876] Traps: 

[0877] privilegedinstruction 
20 [0878] cm_writeaccess error 
[0879] cmregnotexist error 

[0880] Data Transfer Between XPP and Memory TABLE US 00235 

Opcode op3 Operation stxppd op3 Operationstxppd lOOOlO Store word from an XPP 

25 data register into memory ldxppd m emoryldxppd 10001 1 Load word from memory into an XPP 
data register stxppc registerstxppe lOOl 10 Store word from an XPP event register into memory 
ldxppc ldxppe lOOll 1 Load word from memory into an XPP event register 
[0881] Format (3): TABLE US 00236 op rxpp(rd) op3 Rsl 

oprxpp(rd)op3Rs 1 i = 0 asi rs2 op rxpp 0asirs2oprxpp (rd) op3 Rsl i - 1 simml3 31 29 21 
30 =lsimml331 29 24 18 13 12 4 0 



[0882] Assembler Syntax: TABLE US 00237 stxppd rcg.aub.ixpp 
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stxppdregrxpp , [adresse] ldxppd [adrosso], rog.sub.rxpp stxppo rog.sub.rxpp, [adrosso] ldxppo 
[adrosso], rog.sub.rxpp ldxppdfadressel, regrxppstxpperegrxpp, [adresselldxppe [adresse], 
regrxpp 

fO&S3^|-Description: 

5 [0881] STXPPD/STXPPE writes a word from register rxpp into memory. 
[0885] LDXPPD/LDXPPE loads a word from memory into register rxpp. 
[0886] The effective address is calculated as "r[rsl]+r[s2]" in case that i=0, otherwise 
"r [rs 1 ]+simm 1 3 " . 

10 [0887] Traps: 

[0888] xpp readaccess error 
[0889] xpp writeaccess error 
[0890] xppregnotexisterror 
[0891] memaddressnotaligned 

15 

[0892] Data Transfer Between CM and Memory TABLE US 00238 

Opcode op3 Operation Idem op3 Opcrationldcm lOlOlO Load word from memory into a 
CM register stem registerstcm lOlOl 1 Write word from CM register into memory 
[0893] Format (3): TABLE US 00239 op rem 

20 

op_rcm(rd) Op3 rsl i ~ 0 asi rs2 op rc m =0asirs2oprcm (rd) Op3 rsl i ~ 1 simml3 31 29 21 18 13 
i=lsimml331 29 24 18 13 12 4 0 
[0891] Assembler Syntax: TABLE US 00210 Idem rcg.sub.rcm 
25 ldcmregrcm , [adresse] stem [adresse], reg.sub.rcm r egrcm 
f©&9§fDescription: 

[0896] STCM writes a word from register rem into memory. 
[0897] LDCM loads a word from memory into register rem. 

[0898] The effective address is calculated as "r[rsl]+rqrs2]" in case that i=0, otherwise as 
30 "r[rsl]+simml3". 

[0899] Traps: 

[0900] privileged instruction 
[0901] cm readaccess error 
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[0902] cmwriteaccesserror 
[0903] cmregnotexisterror 
[0901] memaddressnotaligned 

5 [0905] Data Transfer from Status Registers to LEON TABLE US 00211 Opcode Op3 
Operation cptolconsdi 101100 

Opcode Op3 Operationcptoleonsdi 1011 OO Copv a word from, the status register of an 

XPP data input 

register into a LEON register cptoloonsdo registercptoleonsdo lOl 101 Copy a word from, the 
1 0 status register of an XPP data output 

register into a LEON register cptoleonsei registercptoleonsei l 10000 Copy a word from the 
status register of an XPP event input 

register into a LEON rogistor cptoloonsoo registercptoleonseo 1 1 000 1 Copy a word aword from 
the status register of an XPP event output 
1 5 register into a LEON register ,. 

[0906] Format (3): TABLE US 00212 11 rd op3 rol rst(opf) ro2 31 29 21 18 13 12 1 0 
Hrdop3rslrst(opf)rs231 29 24 18 13 12 4 0 

[0907] Assembler Syntax: TABLE US 00213 cptoloonsdi rog.sub.rst, rog.sub.rd cptoloonsdo 
20 reg.sub.rst, reg.sub.rd cptoleonsei reg.sub.rst, reg.sub.rd cptoleonseo reg.sub.rst, reg.sub.rd 

Assembler Syntax: 

cptoleonsdiregrst, regrdcptoleonsdoregrst, regrdcptoleonseiregrst, regrdcptoleonseoregrst, regrd 
fO^OS^Description: 

25 

[0909] CPTOLEONSDI loads a word from the status register r[rst] of a data input register into 
the register r[rd] of the LEON processor. 

[0910] CPTOLEONSDO loads a word from the status register r[rst] of a data output register 
into the register r[rd] of the LEON processor. 
30 [0911] CPTOLEONSEI loads a word from the status register r[rst] of an event input register 
into the register r[rd] of the LEON processor. 

[0912] CPTOLEONSEO load a word from the status register r[rst] of an event output register 
into the register r[rd] of the LEON processor. 
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[0913] Traps: 

[0911] streadaccesserror 
[0915] stregnotexisterror 

5 [0916] Data Transfer Between XPP Configuration Register and LEON TABLE US 00211 

Opcode op3 Operation wrclkr op3 Operationwrclkr l 1 1000 Write clock ratio LEON- 

XPP into clock register wroffsetr r egisterwroffsetr l 1 1001 Write into memory offset register for 
memory mapped mode rdclkr moderdclkr l 11010 Read clock register for clock ratio LEON- 
XPP rdoffsotr XPPrdoffsetr l 11011 Read memory offset register for memory mapped mode 
1 0 rdtrapr moderdtrapr l 11110 Read registers with informationon information about XPP trap 
[0917] Format (3): TABLE US 00215 11 rd op3 unused Unused unused 31 29 21 18 
llrdot>3unusedUnusedunused31 29 24 18 13 12 4 0 

1 5 [0918] Assembler Syntax : TABLE US 00216 wrclkr r.sub.rd, % clkr wroffsetr r.sub.rd 

wrclkrrrd. %clkrwroffsetrrrd. %memoffsetrrdclkr%clkr . rrdrdoffsetr % mcmoffsctr rdclkr % 
clkr. r.sub.rd rdoffsotr % mcmoffsctr. r.sub.rd rdtrapr % trapr. r.sub.rd mcmoffsetr. 
rrdrdtrapr%trapr. rrd 
§©£49]-Description: 

20 [0920] WRCLKR loads a word from the register r[rd] into the clock register. In case the register 
contains the value 0, the XPP unit is deactivated, whereas any other value indicates the clock 
ratio of the XPP unit to the LEON processor clock. 

[0921] WROFFSETR loads a word from the register r[rd] into the memory offset register. 
[0922] RDCLKR loads the content from the clock register into the register r[rd] . 
25 [0923] RDOFFSETR loads the content from the memory offset register into the register r[rd]. 
[0921] RDTRAPR loads the content of the trap information register into the register r[rd] . 

[0925] While at least a first embodiment of a coupling is disclosed in the text above, variations 
are possible. 

30 

[0926] FIG. 57 shows another example of a preferred coupling between a conventional (von- 
Neumann-like and/or sequential) processor and an array of processing elements reconfigurable 
at runtime and/or on the fly, the figure referring to an XPP by way of example only, although, 
as in all parts of the present invention, aspects of the disclosure might in some cases be better 
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understood by referring to publications that show and explain the functioning of an XPP in 
more detail. 



[0927] Here, a plurality of details is described in other parts of the present application as will be 
5 obvious between the similarity of figures, yet some particular aspects showing preferred 

implementations and/or embodiments and/or aspects can be found in more detail in FIG. 57. 

[0928] N ow, as for FIG. 57, the attention is drawn to the following facts: 

10 [0929] A coupling may use either one of two different paths, both paths can be implemented as 
an alternative, although in the preferred embodiment, these paths are implemented 
simultaneously. 

[0930] The first path transfers data between the ALU (or other part, particularly in the data 
15 path) of the conventional processor and the XPP is dps-like and is thus intended for low-volume 
data transfer. As shown, it is possible to transfer data from the xpp array, preferably via FIFOs 
and, preferably a MUX allowing selection of either an XPP event data or an XPP result data in 
response to a setting of the MUX preferably by either the processor or the XPP to one or a 
number of operand inputs of the ALU or other units in the data path for ALU operand input 
20 such as MUXes or the like. It is to be noted that a number of different data can be transferred in 
that way, such as status information, flags and the like as well as arithmetic data. This transfer 
can be either from the ALU or a unit downstream therefrom in the datapath of the conventional 
processor. Also, data other than operand data, such as event and/or information regarding 
internal statas can be transferred from the XPP to the conventional processor it is coupled. 

25 

[0931] The second data path is to and/or from the cache and it is to e be noted that a coupling 
may be effected to both the D- and/or the I-cache. The coupling to the I-cache is advantageous 
so as to allow for a very fast reconfiguration of the processing array due to the possibility to 
handle only a minute amount of data within the sequential processor while allowing for large 
30 configuration data by. Here, not the entire configuration must be transferred through the ALU 
or other conventional unit. Reconfiguration can rely on either the conventional processor 
sending configurations or, more preferably, configuration load instructions (e.g. the address of a 
configuration or macro needed) to the array and/or a configuration unit such as a configuration 
manager coupled thereto, e.g. a FILMO and/or can rely on the array itself requesting 
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reconfiguration for example after the instantiation of a first configuration as part of a larger 
macro that has been called as a subroutine or the like by the conventional processor. With 
respect to the data coupling to the D -cache or other (large) memory units such as memory 
banks, it is possible to allow for data streaming, e.g. using load/store configurations within the 
5 array as have been described elsewhere. It is possible to implement various methods of data 

streaming units such as DMA, cache controllers dedicated to operate together with the array and 
the like. It is to be noted that within the data path for this coupling, no register needs be present 
so that block move commands are easily implementable. 

1 0 [0932] One of the advantages of the preferred coupling according to the invention as described 
in one aspect thereof is that it is effected via the instruction pipeline of the conventional 
processor design. The conventional processor and the array can be decoupled does not rely on 
registers, need to handle every single operand separately and also allows for a decoupling of 
processor and array by the use of FIFOs, the later aspect being advantageous in that both 

1 5 devices may be operated asynchronously, that is, it is not absolutely necessary in all and every 
case for one unit to wait until the other has finished a certain task. In contrast, it is sufficient to 
synchronize the two units by methods such as interrupt routines, and/or polling. 

[0933] Also, the coupling shown is preferable over those known in the art since it allows for 
20 coupling into both the data and the control flow. 

[0931] With respect to other parts of the present application, it is noted that whereas this part 
refers to FIFOs used in the data path to effect the data coupling, other parts, esp. those dealing 
in more detail with certain compiler techniques refer to the use of I-RAMs (internal RAMs) to 
25 effect the decoupling. It will be obvious that a FIFO used in the XPP-data input path, XPP data 
event input path and/or XPP config path might be replaced by an I-RAM or that both I-RAMs 
and FIFOs might be used simultaneously. 

[0935] Where reference is being made to event data, it is to be noted that in simple cases these 
30 will be single bit data, but that it is possible to use event vectors as well, that is, event data 
having more than one bit. 
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ABSTRACT 
Abstract 

The present invention relates to a method of coupling at least one (conventional) unit processing 
data in a sequential manner, e.g. a CPU, von-Neumann-Processor and/or microcontroller, the 
5 (conventional) unit for data processing comprising an instruction pipeline, and an array for 
processing data comprising a plurality of data processing cells, e.g. a preferably coarse grain 
and/or preferably runtime reconfigurable data processor, FPGA, DFP, DSP, XPP or 
chaemeleon-technology-like data processing fabric, wherein the array is coupled to the 
instruction pipeline. 
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